StarPII

Model description

This is an NER model trained to detect Personal Identifiable Information (PII) in code datasets. We fine-tuned bigcode-encoder on a PII dataset we annotated, available with gated access at bigcode-pii-dataset (see bigcode-pii-dataset-training for the exact data splits). We added a linear layer as a token classification head on top of the encoder model, with 6 target classes: Names, Emails, Keys, Passwords, IP addresses and Usernames.

Dataset

Fine-tuning on the annotated dataset

The finetuning dataset contains 20961 secrets and 31 programming languages, but the base encoder model was pre-trained on 88 programming languages from The Stack dataset.

Initial training on a pseudo-labelled dataset

To enhance model performance on some rare PII entities like keys, we initially trained on a pseudo-labeled dataset before fine-tuning on the annotated dataset. The method involves training a model on a small set of labeled data and subsequently generating predictions for a larger set of unlabeled data.

Specifically, we annotated 18,000 files available at bigcode-pii-ppseudo-labeled using an ensemble of two encoder models Deberta-v3-large and stanford-deidentifier-base which were fine-tuned on an intern previously labeled PII dataset for code with 400 files from this work. To select good-quality pseudo-labels, we computed the average probability logits between the models and filtered based on a minimum score. After inspection, we observed a high rate of false positives for Keys and Passwords, hence we retained only the entities that had a trigger word like key, auth and pwd in the surrounding context. Training on this synthetic dataset prior to fine-tuning on the annotated one yielded superior results for all PII categories, as demonstrated in the table in the following section.

Performance

This model is respresented in the last row (NER + pseudo labels )

Method Email address IP address Key
Prec. Recall F1 Prec. Recall F1 Prec. Recall F1
Regex 69.8% 98.8% 81.8% 65.9% 78% 71.7% 2.8% 46.9% 5.3%
NER 94.01% 98.10% 96.01% 88.95% 94.43% 91.61% 60.37% 53.38% 56.66%
+ pseudo labels 97.73% 98.94% 98.15% 90.10% 93.86% 91.94% 62.38% 80.81% 70.41%
Method Name Username Password
Prec. Recall F1 Prec. Recall F1 Prec. Recall F1
NER 83.66% 95.52% 89.19% 48.93% 75.55% 59.39% 59.16% 96.62% 73.39%
+ pseudo labels 86.45% 97.38% 91.59% 52.20% 74.81% 61.49% 70.94% 95.96% 81.57%

We used this model to mask PII in the bigcode large model training. We dropped usernames since they resulted in many false positives and negatives. For the other PII types, we added the following post-processing that we recommend for future uses of the model (the code is also available on GitHub):

Considerations for Using the Model

While using this model, please be aware that there may be potential risks associated with its application. There is a possibility of false positives and negatives, which could lead to unintended consequences when processing sensitive data. Moreover, the model's performance may vary across different data types and programming languages, necessitating validation and fine-tuning for specific use cases. Researchers and developers are expected to uphold ethical standards and data protection measures when using the model. By making it openly accessible, our aim is to encourage the development of privacy-preserving AI technologies while remaining vigilant of potential risks associated with PII.