Model Card for ARDisplay-I

The model predicts peptide presentation on the cell surface via a given HLA class I molecule. It was introduced in the paper Identification of tumor-specific MHC ligands through improved biochemical isolation and incorporation of machine learning by Shima Mecklenbräuker, Piotr Skoczylas, Paweł Biernat, Badeel Zaghla, Bartłomiej Król-Józaga, Maciej Jasiński, Victor Murcia Pienkowski, Anna Sanecka-Duin, Oliver Popp, Rafał Szatanek, Philipp Mertins, Jan Kaczmarczyk, Agnieszka Blum, and Martin G. Klatt.

Model Details

The peptide-HLA (pHLA) presentation is a major mechanism by which our immune system can recognize abnormal cells (e.g. altered by cancer or viral infections). ARDisplay-I predicts whether a given peptide will be displayed on the cell surface via a given HLA class I molecule. Such a presentation event enables immunosurveillance, and if the antigen is recognized as non-self, this can trigger an immune response.

The pHLA presentation itself is a complex multi-stage process composed of antigen processing followed by its attachment to a particular HLA molecule and the transportation of the whole pHLA complex to the cell surface. Within each human cell, proteins are constantly degraded into short peptides or amino acids. During this process, some protein fragments, typically 8-11 amino-acid long, may bind to a specific HLA molecule and subsequently be transported to the cell surface. The predictions from our model encompass the entire processing and presentation pathway.

Please note, that in most application scenarios, the model requires additional post-processing steps and appropriate filtering. Moreover, if your data is not standard (ie. contains neoepitopes, peptides originating from alternative splicing, virus epitopes, dark antigens, etc.), you might need additional domain knowledge and/or a model fine-tuned to your needs. If necessary, feel free to contact us for support.

The model was developed at Ardigen as part of Immunology platform. Free access to the regular version is available via the Hugging Face platform for non-commercial academic use only (see License). For commercial use and the Pro model versions, we encourage you to contact us at ardisplay@ardigen.com.

We invite you to take a look at the full offer

Model Description

Developed by: Ardigen S.A. - AI in Drug Discovery
Model type: Protein Language Model
License: Other

Model Sources

Demo: https://huggingface.co/spaces/ardigen/ardisplay-i

Uses

The model takes peptide-HLA (pHLA) pairs as input and returns a presentation score in the range between [0, 1]. It can be used to select peptides with the highest probability of being presented by specific HLA molecules, find protein fragments with a high presentation probability, find multiple HLAs presenting a given peptide, or scan an entire protein for presented subsequences.

Limitations

Supports a pre-defined set of HLAs.
Does not work on peptides containing ambiguous amino acids, like X, or J.
Assumes short peptides on input, limited to between 8 and 11 amino acids.
Replaces selenocysteine (U) with cysteine (C) before running inference.

Metrics

Ardigen's ARDisplay-I with over 2 times higher Average Precision

Our model enables the prediction of HLA-I presented peptides with over 2 times higher Average Precision than the current state-of-the-art (solutions from netMHCpan and MHCflurry).

The study cohort includes the multiple myeloma cell lines JJN3 and LP-1 as well as the lymphoblastic leukemia cell line Nalm-6. Data consists of the MS results generated by Dr. Philipp Mertins, Martin Klatt, M.D., et al. and describes more than 32,000 HLA ligands presented on the cell surface of one of the three cell lines expressing together 17 distinct HLA class I alleles.

Comparison of precision-recall (PR) curves.

Our model achieves higher results at each point of the PR curve. The regions with standard deviation do not overlap, which indicates a high statistical significance of the performance difference between the methods.

Positive predictive values (PPV) with four selected thresholds, i.e., top-10, 20, 50, & 100 pHLA pairs selected by each method. For example, PPV (top 10) is the expected fraction of presented pHLA pairs among the top 10 pHLAs ranked by the respective model.

Find out more about Identifying therapeutic targets.

How to Get Started with the Model

You can visit our interactive demo and try the model there. Alternatively, you can run the model on your machine from Python as a CLI tool by following the sections below.

Huggingface

Install the dependencies

pip install -U transformers==4.30.1 torch==1.13.1 tape_proteins==0.5 mhcflurry==2.0.4 mhcgnomes==1.7.0

and the auxiliary MHCflurry model for binding affinity prediction

mhcflurry-downloads fetch --release 1.7.0 models_class1_pan

Use the code below to get started with the model.

from transformers import pipeline

pipe = pipeline(model="ardigen/ardisplay-i", trust_remote_code=True)
data = ["A01:02,AAAAAAAA", "A01:02,CCCCCCCCCC"]
result = pipe(data)
print(result)

The peptides passed to the model need to have a length between 8 and 11 AAs and cannot contain ambiguous amino acid descriptors, like X, B, Z, J, etc.

CLI

You can also install the model as a CLI tool for usage in bioinformatics pipelines with the following command (assuming you have python3 and pip installed)

wget https://huggingface.co/ardigen/ardisplay-i/raw/main/cli/install.sh -O - | bash

This will install the ardisplay-i-cli tool which takes a text file with a list of HLA,peptide pairs and outputs a .csv file. See ardisplay-i-cli --help for the details.

Training Details

The details of model training are proprietary.