CodeGen-350M-multi-xlcost-v2
CodeGen-350M-multi-xlcost is a CodeGen model fine-tuned on the Python split of XLCost dataset using Deepspeed.
Usage
You can load the CodeGen-350M-multi-xlcost-v2 model and tokenizer directly in transformers
:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("giulio98/codegen-350M-multi-xlcost-v2")
model = AutoModelForCausalLM.from_pretrained("giulio98/codegen-350M-multi-xlcost-v2")
text = tokenizer.eos_token + "\'\'\'\n" + "function to add two numbers" + "\n\'\'\'\n" + "###\n"
input_ids = tokenizer(text, return_tensors="pt").input_ids
generated_ids = model.generate(input_ids, max_length=128)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
Output:
'''
function to add two numbers
'''
###
def add(a, b):
return a + b
Training
The model was finetuned on XLCost-single-prompt, an improved version of the original XLCost dataset xlcost-text-to-code. Below the hyperparameters.
Hyperparameter | value |
---|---|
Per device train batch size | 16 |
Context size | 1024 |
Training steps | 259 |
Gradient accumulation | 2 |
Gradient checkpointing | True |
Learning rate | 1.8e-05 |
Weight decay | 0.1 |
Warmup steps | 35 |
Schedule | linear |
zero stage | 2 |
Below the deepspeed configuration
{
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": 0.000018,
"betas": [
0.9,
0.999
],
"eps": 1e-8,
"weight_decay": 0.1
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 0.000018,
"warmup_num_steps": 35
}
},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu",
"pin_memory": false
},
"allgather_partitions": true,
"allgather_bucket_size": 200000000,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 200000000,
"contiguous_gradients": true
},
"gradient_accumulation_steps": 2,
"train_batch_size": 32,
"train_micro_batch_size_per_gpu": 16,
"gradient_clipping": 1,
"wall_clock_breakdown": false
}
The training was executed on 1 x V100 (16GB) GPU for 28min 50sec
Performance
We evaluated the model on the first 400 samples of XLCOST's XLCost-single-prompt test split and comparing the outputs of the generated codes with respect to the expected output using pass@k metric.
Metric | codegen-350M-multi-xlcost-v2 | codegen-350M-multi-xlcost | codegen-350M-mono(zero-shot) | codegen-350M-mono (one-shot) | codegen-350M-mono(few-shot) |
---|---|---|---|---|---|
pass@1 | 3.325% | 3.70% | 0.4% | 0.35% | 0.48% |
pass@10 | 15% | 14.5% | 3.5% | 3 % | 3.75% |
CodeBLEU | 20.18% | None | 15.15% | 19.42 % | 20.27% |
The pass@k metric tells the probability that at least one out of k generations passes the tests.
Citations
@article{Nijkamp2022ACP,
title={A Conversational Paradigm for Program Synthesis},
author={Nijkamp, Erik and Pang, Bo and Hayashi, Hiroaki and Tu, Lifu and Wang, Huan and Zhou, Yingbo and Savarese, Silvio and Xiong, Caiming},
journal={arXiv preprint},
year={2022}
}