PII Redaction Masking LLM Llama2

<p align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/6488b81bc6b1f2b4c8d93d4e/pukpRSNaPbWiSKhuiGF8R.jpeg" alt="banner"> </p>

<p align="center"> ๐Ÿค— <a href="https://www.linkedin.com/in/harryroy/" target="_blank">About me</a> โ€ข๐Ÿฑ <a href="https://www.harry.vc/" target="_blank">Harry.vc</a> โ€ข ๐Ÿฆ <a href="https://X.com/" target="_blank">X.com</a> โ€ข ๐Ÿ“ƒ <a href="https://arxiv.org/" target="_blank">Papers</a> <br> </p>

๐Ÿฅท Model Card for King-Harry/NinjaMasker-PII-Redaction

This model is designed for the redaction and masking of Personally Identifiable Information (PII) in complex text scenarios like call transcripts.

News

Model Details

๐Ÿ“– Model Description

This model aims to handle complex and difficult instances of PII redaction that traditional classification models struggle with.

๐ŸŒฑ Model Sources

๐Ÿงช Test the model

Log into HuggingFace (if not already)

!pip install transformers
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, logging
from huggingface_hub import notebook_login
notebook_login()

Load Model

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, logging

# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Load the model and tokenizer with authentication token
model_name = "King-Harry/NinjaMasker-PII-Redaction"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

Generate Text

# Generate text
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=100)
prompt = "My name is Harry and I live in Winnipeg. My phone number is ummm 204 no 203, ahh 4344, no 4355"
result = pipe(f"<s>[INST] {prompt} [/INST]")

# Print the generated text
print(result[0]['generated_text'])

Uses

๐ŸŽฏ Direct Use

The model is specifically designed for direct redaction and masking of PII in complex text inputs such as call transcripts.

โฌ‡๏ธ Downstream Use

The model has potential for numerous downstream applications, though specific use-cases are yet to be fully explored.

โŒ Out-of-Scope Use

The model is under development; use in critical systems requiring 100% accuracy is not recommended at this stage.

โš–๏ธ Bias, Risks, and Limitations

The model is trained only on English text, which may limit its applicability in multilingual or non-English settings.

๐Ÿ‘ Recommendations

Users should be aware of the model's language-specific training and should exercise caution when using it in critical systems.

๐Ÿ‹๏ธ Training Details

๐Ÿ“Š Training Data

The model was trained on a dataset of 43,000 question/answer pairs that contained various forms of PII. There are 63 labels that the model looks for.

โš™๏ธ Training Hyperparameters

๐Ÿš€ Speeds, Sizes, Times

๐Ÿ“‹ Evaluation

Evaluation is pending.

๐ŸŒ Environmental Impact

Given the significant computing resources used, the model likely has a substantial carbon footprint. Exact calculations are pending.

๐Ÿ“„ Technical Specifications

๐Ÿ›๏ธ Model Architecture and Objective

The model is a fine-tuned version of LLama 2 7b, tailored for PII redaction tasks.

๐Ÿ–ฅ๏ธ Hardware

๐Ÿ’พ Software

โœ๏ธ Model Card Authors

Harry Roy McLaughlin

๐Ÿ“ž Model Card Contact

harry.roy@gmail.com