Model Card for ARDisplay-I
The model predicts peptide presentation on the cell surface via a given HLA class I molecule. It was introduced in the paper Identification of tumor-specific MHC ligands through improved biochemical isolation and incorporation of machine learning by Shima Mecklenbräuker, Piotr Skoczylas, Paweł Biernat, Badeel Zaghla, Bartłomiej Król-Józaga, Maciej Jasiński, Victor Murcia Pienkowski, Anna Sanecka-Duin, Oliver Popp, Rafał Szatanek, Philipp Mertins, Jan Kaczmarczyk, Agnieszka Blum, and Martin G. Klatt.
Model Details
The peptide-HLA (pHLA) presentation is a major mechanism by which our immune system can recognize abnormal cells (e.g. altered by cancer or viral infections). ARDisplay-I predicts whether a given peptide will be displayed on the cell surface via a given HLA class I molecule. Such a presentation event enables immunosurveillance, and if the antigen is recognized as non-self, this can trigger an immune response.
The pHLA presentation itself is a complex multi-stage process composed of antigen processing followed by its attachment to a particular HLA molecule and the transportation of the whole pHLA complex to the cell surface. Within each human cell, proteins are constantly degraded into short peptides or amino acids. During this process, some protein fragments, typically 8-11 amino-acid long, may bind to a specific HLA molecule and subsequently be transported to the cell surface. The predictions from our model encompass the entire processing and presentation pathway.
Please note, that in most application scenarios, the model requires additional post-processing steps and appropriate filtering. Moreover, if your data is not standard (ie. contains neoepitopes, peptides originating from alternative splicing, virus epitopes, dark antigens, etc.), you might need additional domain knowledge and/or a model fine-tuned to your needs. If necessary, feel free to contact us for support.
The model was developed at Ardigen as part of Immunology platform. Free access to the regular version is available via the Hugging Face platform for non-commercial academic use only (see License). For commercial use and the Pro model versions, we encourage you to contact us at ardisplay@ardigen.com.
We invite you to take a look at the full offer
Model Description
- Developed by: Ardigen S.A. - AI in Drug Discovery
- Model type: Protein Language Model
- License: Other
Model Sources
- Demo: https://huggingface.co/spaces/ardigen/ardisplay-i
Uses
The model takes peptide-HLA (pHLA) pairs as input and returns a presentation score in the range between [0, 1]. It can be used to select peptides with the highest probability of being presented by specific HLA molecules, find protein fragments with a high presentation probability, find multiple HLAs presenting a given peptide, or scan an entire protein for presented subsequences.
Limitations
- Supports a pre-defined set of HLAs.
- Does not work on peptides containing ambiguous amino acids, like X, or J.
- Assumes short peptides on input, limited to between 8 and 11 amino acids.
- Replaces selenocysteine (U) with cysteine (C) before running inference.
Metrics
Ardigen's ARDisplay-I with over 2 times higher Average Precision
Our model enables the prediction of HLA-I presented peptides with over 2 times higher Average Precision than the current state-of-the-art (solutions from netMHCpan and MHCflurry).
The study cohort includes the multiple myeloma cell lines JJN3 and LP-1 as well as the lymphoblastic leukemia cell line Nalm-6. Data consists of the MS results generated by Dr. Philipp Mertins, Martin Klatt, M.D., et al. and describes more than 32,000 HLA ligands presented on the cell surface of one of the three cell lines expressing together 17 distinct HLA class I alleles.
<div style="text-align:center"> <img src="https://huggingface.co/ardigen/ardisplay-i/resolve/main/documentation_images/benchmark_PR_curves.png" alt="Comparison of precision-recall (PR) curves" width="500"/> </div>
Comparison of precision-recall (PR) curves.
Our model achieves higher results at each point of the PR curve. The regions with standard deviation do not overlap, which indicates a high statistical significance of the performance difference between the methods.
<div style="text-align:center"> <img src="https://huggingface.co/ardigen/ardisplay-i/resolve/main/documentation_images/benchmark_PPVs.png" alt="Positive predictive values (PPV)" width="500"/> </div>
Positive predictive values (PPV) with four selected thresholds, i.e., top-10, 20, 50, & 100 pHLA pairs selected by each method. For example, PPV (top 10) is the expected fraction of presented pHLA pairs among the top 10 pHLAs ranked by the respective model.
Find out more about Identifying therapeutic targets.
How to Get Started with the Model
You can visit our interactive demo and try the model there. Alternatively, you can run the model on your machine from Python as a CLI tool by following the sections below.
Huggingface
Install the dependencies
pip install -U transformers==4.30.1 torch==1.13.1 tape_proteins==0.5 mhcflurry==2.0.4 mhcgnomes==1.7.0
and the auxiliary MHCflurry model for binding affinity prediction
mhcflurry-downloads fetch --release 1.7.0 models_class1_pan
Use the code below to get started with the model.
from transformers import pipeline
pipe = pipeline(model="ardigen/ardisplay-i", trust_remote_code=True)
data = ["A01:02,AAAAAAAA", "A01:02,CCCCCCCCCC"]
result = pipe(data)
print(result)
The peptides passed to the model need to have a length between 8 and 11 AAs and cannot contain ambiguous amino acid descriptors, like X, B, Z, J, etc.
CLI
You can also install the model as a CLI tool for usage in bioinformatics pipelines with the following command (assuming you have python3 and pip installed)
wget https://huggingface.co/ardigen/ardisplay-i/raw/main/cli/install.sh -O - | bash
This will install the ardisplay-i-cli
tool which takes a text file with a list
of HLA,peptide
pairs and outputs a .csv file. See ardisplay-i-cli --help
for
the details.
Training Details
The details of model training are proprietary.