RepoSim

An approach to compare semantic similarities between Python repositories.

Model Details

RepoSim is a pipeline used to create embeddings for specified Python repositories on GitHub. For each repository, it extracts and encodes all functions' source code and docstrings into embeddings, then average them to get the mean code embeddings and the mean docstring embeddings, which can be used to perform various tasks such as cosine similarity comparison.

Model Description

The model used by RepoSim is UniXcoder fine-tuned on code search task, using the AdvTest dataset.

Pipeline developed by: Lazyhope
Repository: RepoSim
Model type: code understanding
Language(s): Python
License: MIT

Model Sources

Repository: UniXcoder
Paper: UniXcoder: Unified Cross-Modal Pre-training for Code Representation

Uses

Below is an example of how to use the RepoSim pipeline to easily generate embeddings for GitHub Python repositories.

First, initialise the pipeline:

from transformers import pipeline

model = pipeline(model="Lazyhope/RepoSim", trust_remote_code=True)

Then specify one (or multiple repositories in a tuple) as input and get the result as a list of dictionaries:

repo_infos = model("lazyhope/python-hello-world")
print(repo_infos)

Output (Long tensor outputs are omitted):

[{'name': 'lazyhope/python-hello-world',
  'topics': [],
  'license': 'MIT',
  'stars': 0,
  'code_embeddings': [["def main():\n    print('Hello World!')",
    [-2.0755109786987305,
     2.813878297805786,
     2.352170467376709, ...]]],
  'mean_code_embedding': [-2.0755109786987305,
   2.813878297805786,
   2.352170467376709, ...],
  'doc_embeddings': [['Prints hello world',
    [-2.3749449253082275,
     0.5409570336341858,
     2.2958014011383057, ...]]],
  'mean_doc_embedding': [-2.3749449253082275,
   0.5409570336341858,
   2.2958014011383057, ...]}]

Training Details

Please follow the original UniXcoder page for details of fine-tuning it on code search task.

Evaluation

We used the awesome-python list which contains over 500 Python repositories categorized in different topics, in order to label similar repositories. The evaluation metrics and results can be found in the RepoSim repository, under the notebooks folder.

Acknowledgements

Many thanks to authors of the UniXcoder model and the AdvTest dataset, as well as the awesome python list for providing a useful baseline.

UniXcoder (https://github.com/microsoft/CodeBERT/tree/master/UniXcoder)
AdvTest (https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/NL-code-search-Adv)
awesome-python (https://github.com/vinta/awesome-python)

Authors

Zihao Li (https://github.com/lazyhope)
Rosa Filgueira (https://www.rosafilgueira.com)