<p align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/6488b81bc6b1f2b4c8d93d4e/pukpRSNaPbWiSKhuiGF8R.jpeg" alt="banner"> </p>
<p align="center"> ๐ค <a href="https://www.linkedin.com/in/harryroy/" target="_blank">About me</a> โข๐ฑ <a href="https://www.harry.vc/" target="_blank">Harry.vc</a> โข ๐ฆ <a href="https://X.com/" target="_blank">X.com</a> โข ๐ <a href="https://arxiv.org/" target="_blank">Papers</a> <br> </p>
๐ฅท Model Card for King-Harry/NinjaMasker-PII-Redaction
This model is designed for the redaction and masking of Personally Identifiable Information (PII) in complex text scenarios like call transcripts.
News
- ๐ฅ๐ฅ๐ฅ[2023/10/06] Building New Dataset creating a significantly improved dataset, fixing stop tokens.
- ๐ฅ๐ฅ๐ฅ[2023/10/05] NinjaMasker-PII-Redaction version 1, was released.
Model Details
๐ Model Description
This model aims to handle complex and difficult instances of PII redaction that traditional classification models struggle with.
- Developed by: Harry Roy McLaughlin
- Model type: Fine-tuned Language Model
- Language(s) (NLP): English
- License: TBD
- Finetuned from the model: NousResearch/Llama-2-7b-chat-hf
๐ฑ Model Sources
- Repository: Hosted on HuggingFace
- Demo: Coming soon
๐งช Test the model
Log into HuggingFace (if not already)
!pip install transformers
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, logging
from huggingface_hub import notebook_login
notebook_login()
Load Model
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, logging
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)
# Load the model and tokenizer with authentication token
model_name = "King-Harry/NinjaMasker-PII-Redaction"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
Generate Text
# Generate text
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=100)
prompt = "My name is Harry and I live in Winnipeg. My phone number is ummm 204 no 203, ahh 4344, no 4355"
result = pipe(f"<s>[INST] {prompt} [/INST]")
# Print the generated text
print(result[0]['generated_text'])
Uses
๐ฏ Direct Use
The model is specifically designed for direct redaction and masking of PII in complex text inputs such as call transcripts.
โฌ๏ธ Downstream Use
The model has potential for numerous downstream applications, though specific use-cases are yet to be fully explored.
โ Out-of-Scope Use
The model is under development; use in critical systems requiring 100% accuracy is not recommended at this stage.
โ๏ธ Bias, Risks, and Limitations
The model is trained only on English text, which may limit its applicability in multilingual or non-English settings.
๐ Recommendations
Users should be aware of the model's language-specific training and should exercise caution when using it in critical systems.
๐๏ธ Training Details
๐ Training Data
The model was trained on a dataset of 43,000 question/answer pairs that contained various forms of PII. There are 63 labels that the model looks for.
โ๏ธ Training Hyperparameters
- Training regime: FP16
๐ Speeds, Sizes, Times
- Hardware: T4 GPU
- Cloud Provider: Google CoLab Pro (for the extra RAM)
- Training Duration: ~4 hours
๐ Evaluation
Evaluation is pending.
๐ Environmental Impact
Given the significant computing resources used, the model likely has a substantial carbon footprint. Exact calculations are pending.
- Hardware Type: T4 GPU
- Hours used: ~4
- Cloud Provider: Google CoLab Pro
๐ Technical Specifications
๐๏ธ Model Architecture and Objective
The model is a fine-tuned version of LLama 2 7b, tailored for PII redaction tasks.
๐ฅ๏ธ Hardware
- Training Hardware: T4 GPU (with extra RAM)
๐พ Software
-
Environment: Google CoLab Pro
-
๐ช Disclaimer
-
This model is in its first generation and will be updated rapidly.
โ๏ธ Model Card Authors
๐ Model Card Contact
harry.roy@gmail.com