🤗 <a href="https://www.linkedin.com/in/harryroy/" target="_blank">About me</a> •🐱 <a href="https://www.harry.vc/" target="_blank">Harry.vc</a> • 🐦 <a href="https://X.com/" target="_blank">X.com</a> • 📃 <a href="https://arxiv.org/" target="_blank">Papers</a>

🥷 Model Card for King-Harry/NinjaMasker-PII-Redaction

This model is designed for the redaction and masking of Personally Identifiable Information (PII) in complex text scenarios like call transcripts.

News

🔥🔥🔥[2023/10/06] Building New Dataset creating a significantly improved dataset, fixing stop tokens.
🔥🔥🔥[2023/10/05] NinjaMasker-PII-Redaction version 1, was released.

Model Details

📖 Model Description

This model aims to handle complex and difficult instances of PII redaction that traditional classification models struggle with.

Developed by: Harry Roy McLaughlin
Model type: Fine-tuned Language Model
Language(s) (NLP): English
License: TBD
Finetuned from the model: NousResearch/Llama-2-7b-chat-hf

🌱 Model Sources

Repository: Hosted on HuggingFace
Demo: Coming soon

🧪 Test the model

Log into HuggingFace (if not already)

!pip install transformers
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, logging
from huggingface_hub import notebook_login
notebook_login()

Load Model

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, logging

# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Load the model and tokenizer with authentication token
model_name = "King-Harry/NinjaMasker-PII-Redaction"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

Generate Text

# Generate text
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=100)
prompt = "My name is Harry and I live in Winnipeg. My phone number is ummm 204 no 203, ahh 4344, no 4355"
result = pipe(f"<s>[INST] {prompt} [/INST]")

# Print the generated text
print(result[0]['generated_text'])

Uses

🎯 Direct Use

The model is specifically designed for direct redaction and masking of PII in complex text inputs such as call transcripts.

⬇️ Downstream Use

The model has potential for numerous downstream applications, though specific use-cases are yet to be fully explored.

❌ Out-of-Scope Use

The model is under development; use in critical systems requiring 100% accuracy is not recommended at this stage.

⚖️ Bias, Risks, and Limitations

The model is trained only on English text, which may limit its applicability in multilingual or non-English settings.

👍 Recommendations

Users should be aware of the model's language-specific training and should exercise caution when using it in critical systems.

🏋️ Training Details

📊 Training Data

The model was trained on a dataset of 43,000 question/answer pairs that contained various forms of PII. There are 63 labels that the model looks for.

⚙️ Training Hyperparameters

Training regime: FP16

🚀 Speeds, Sizes, Times

Hardware: T4 GPU
Cloud Provider: Google CoLab Pro (for the extra RAM)
Training Duration: ~4 hours

📋 Evaluation

Evaluation is pending.

🌍 Environmental Impact

Given the significant computing resources used, the model likely has a substantial carbon footprint. Exact calculations are pending.

Hardware Type: T4 GPU
Hours used: ~4
Cloud Provider: Google CoLab Pro

📄 Technical Specifications

🏛️ Model Architecture and Objective

The model is a fine-tuned version of LLama 2 7b, tailored for PII redaction tasks.

🖥️ Hardware

Training Hardware: T4 GPU (with extra RAM)

💾 Software

Environment: Google CoLab Pro
🪖 Disclaimer
This model is in its first generation and will be updated rapidly.

✍️ Model Card Authors

Harry Roy McLaughlin

📞 Model Card Contact

harry.roy@gmail.com