code

Model Summary

The language model phi-1 is a Transformer with 1.3 billion parameters, specialized for basic Python coding. Its training involved a variety of data sources, including subsets of Python codes from The Stack v1.2, Q&A content from StackOverflow, competition code from code_contests, and synthetic Python textbooks and exercises generated by gpt-3.5-turbo-0301. Even though the model and the datasets are relatively small compared to contemporary Large Language Models (LLMs), phi-1 has demonstrated an impressive accuracy rate exceeding 50% on the simple Python coding benchmark, HumanEval.

Intended Uses

Given the nature of the training data, phi-1 is best suited for prompts using the code format:

code format:

def print_prime(n):
   """
   Print all primes between 1 and n
   """
   for num in range(2, n+1):
       for i in range(2, num):
           if num % i == 0:
               break
       else:
           print(num)

where the model generates the code after the comments. (Note: This is a legitimate and correct use of the else statement in Python loops.)

Notes

Limitations of phi-1

Warning about Security Risks

When leveraging phi-1, it's paramount to be vigilant. The model, though powerful, can inadvertently introduce security vulnerabilities in the generated code. Examples include, but are not limited to:

Given these potential pitfalls, and others not explicitly mentioned, it's essential to thoroughly review, test, and verify the generated code before deploying it in any application, especially those that are security-sensitive. Always consult with security experts or perform rigorous penetration testing when in doubt.

Training

Model

Software

License

The model is licensed under the Research License.

Sample Code

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

torch.set_default_device("cuda")
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1", trust_remote_code=True)
inputs = tokenizer('''def print_prime(n):
   """
   Print all primes between 1 and n
   """''', return_tensors="pt", return_attention_mask=False)

outputs = model.generate(**inputs, max_length=200)
text = tokenizer.batch_decode(outputs)[0]
print(text)

If you need to use the model in a lower precision (e.g., FP16), please wrap the model's forward pass with torch.autocast(), as follows:

with torch.autocast(model.device.type, dtype=torch.float16, enabled=True):
    outputs = model.generate(**inputs, max_length=200)

Remark. In the generation function, our model currently does not support beam search (num_beams >1). Furthermore, in the forward pass of the model, we currently do not support outputting hidden states or attention values, or using custom input embeddings (instead of the model's).

Citation

@article{gunasekar2023textbooks,
  title={Textbooks Are All You Need},
  author={Gunasekar, Suriya and Zhang, Yi and Aneja, Jyoti and Mendes, Caio C{\'e}sar Teodoro and Del Giorno, Allie and Gopi, Sivakanth and Javaheripi, Mojan and Kauffmann, Piero and de Rosa, Gustavo and Saarikivi, Olli and others},
  journal={arXiv preprint arXiv:2306.11644},
  year={2023}
}