<!--

Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Redistribution and use in source and binary forms, with or without

modification, are permitted provided that the following conditions

are met:

* Redistributions of source code must retain the above copyright

notice, this list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright

notice, this list of conditions and the following disclaimer in the

documentation and/or other materials provided with the distribution.

* Neither the name of NVIDIA CORPORATION nor the names of its

contributors may be used to endorse or promote products derived

from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY

EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE

IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR

PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR

CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,

EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,

PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR

PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY

OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT

(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE

OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

-->

Deploy models using Triton

Navigate to	Part 2: Improving Resource Utilization	Documentation: Model Repository	Documentation: Model Configuration

Any deep learning inference serving solution needs to tackle two fundamental challenges:

Managing multiple models.
Versioning, loading, and unloading models.

Before we begin

The conceptual guide aims to educate developers about the challenges faced whilst building inference infrastructure for deploying deep learning pipelines. Part 1 - Part 5 of this guide build towards solving a simple problem: deploying a performant and scalable pipeline for transcribing text from images. This pipeline includes 5 steps:

Pre-process the raw image
Detect which parts of the image contain text (Text Detection Model)
Crop image to regions with text
Find text probabilities (Text Recognition Model)
Convert probabilities to actual text

In Part 1, we start by deploying both models on Triton with the pre/post processing steps done on the client.

Deploying multiple models

The key challenge around managing multiple models is to build an infrastructure that can cater to the different requirements of different models. For instance, users may need to deploy a PyTorch model and TensorFlow model on the same server, and they have different loads for both the models, need to run them on different hardware devices, and need to independently manage the serving configurations (model queues, versions, caching, acceleration, and more). The Triton Inference Server caters to all of the above and more.

multiple models

The first step in deploying models using the Triton Inference Server is building a repository that houses the models which will be served and the configuration schema. For the purposes of this demonstration, we will be making use of an EAST model to detect text and a text recognition model. This workflow is largely an adaptation of OpenCV's Text Detection samples.

To begin, let's clone the repository and navigate to this folder.

cd Conceptual_Guide/Part_1-model_deployment

Next, we'll be downloading the necessary models and making sure they are in a format that triton can deploy.

Model 1: Text Detection

Download and unzip OpenCV's EAST model.

wget https://www.dropbox.com/s/r2ingd0l3zt8hxs/frozen_east_text_detection.tar.gz
tar -xvf frozen_east_text_detection.tar.gz

Export to ONNX.

Note: The following step requires you to have the TensorFlow library installed. We recommend executing the following step within the NGC TensorFlow container environment, which you can launch with docker run -it --gpus all -v ${PWD}:/workspace nvcr.io/nvidia/tensorflow:<yy.mm>-tf2-py3

pip install -U tf2onnx
python -m tf2onnx.convert --input frozen_east_text_detection.pb --inputs "input_images:0" --outputs "feature_fusion/Conv_7/Sigmoid:0","feature_fusion/concat_3:0" --output detection.onnx

Model 2: Text Recognition

Download the Text Recognition model weights.

wget https://www.dropbox.com/sh/j3xmli4di1zuv3s/AABzCC1KGbIRe2wRwa3diWKwa/None-ResNet-None-CTC.pth

Export the models as .onnx using the file in the model definition file in the utils folder. This file is adapted from Baek et. al. 2019.

Note: The following python script requires you to have the PyTorch library installed. We recommend executing the following step within the NGC PyTorch container environment, which you can launch with docker run -it --gpus all -v ${PWD}:/workspace nvcr.io/nvidia/pytorch:<yy.mm>-py3

import torch
from utils.model import STRModel

# Create PyTorch Model Object
model = STRModel(input_channels=1, output_channels=512, num_classes=37)

# Load model weights from external file
state = torch.load("None-ResNet-None-CTC.pth")
state = {key.replace("module.", ""): value for key, value in state.items()}
model.load_state_dict(state)

# Create ONNX file by tracing model
trace_input = torch.randn(1, 1, 32, 100)
torch.onnx.export(model, trace_input, "str.onnx", verbose=True)

Setting up the model repository

A model repository is Triton's way of reading your models and any associated metadata with each model (configurations, version files, etc.). These model repositories can live in a local or network attatched filesystem, or in a cloud object store like AWS S3, Azure Blob Storage or Google Cloud Storage. For more details on model repository location, refer to the documentation. Servers can use also multiple different model repositories. For simplicity, this explanation only uses a single repository stored in the local filesystem, in the following format:

# Example repository structure
<model-repository>/
  <model-name>/
    [config.pbtxt]
    [<output-labels-file> ...]
    <version>/
      <model-definition-file>
    <version>/
      <model-definition-file>
    ...
  <model-name>/
    [config.pbtxt]
    [<output-labels-file> ...]
    <version>/
      <model-definition-file>
    <version>/
      <model-definition-file>
    ...
  ...

There are three important components to be discussed from the above structure:

model-name: The identifying name for the model.
config.pbtxt: For each model, users can define a model configuration. This configuration, at minimum, needs to define: the backend, name, shape, and datatype of model inputs and outputs. For most of the popular backends, this configuration file is autogenerated with defaults. The full specification of the configuration file can be found in the model_config protobuf definition.
version: versioning makes multiple versions of the same model available for use depending on the policy selected. More Information about versioning.

For this example you can set up the model repository structure in the following manner:

mkdir -p model_repository/text_detection/1
mv detection.onnx model_repository/text_detection/1/model.onnx

mkdir -p model_repository/text_recognition/1
mv str.onnx model_repository/text_recognition/1/model.onnx

These commands should give you a repository that looks this:

# Expected folder layout
model_repository/
├── text_detection
│   ├── 1
│   │   └── model.onnx
│   └── config.pbtxt
└── text_recognition
    ├── 1
    │   └── model.onnx
    └── config.pbtxt

Note that, for this example, we've already created the config.pbtxt files and placed them in the necessary location. In the next section, we'll discuss the contents of these files.

Model configuration

With the models and the file structure ready, the next things we need to look at are the config.pbtxt model configuration files. Let's first look at the model configuration for the EAST text detection model that's been provided for you at /model_repository/text_detection/config.pbtxt. This shows that text_detection is an ONNX model that has one input and two output tensors.

name: "text_detection"
backend: "onnxruntime"
max_batch_size : 256
input [
  {
    name: "input_images:0"
    data_type: TYPE_FP32
    dims: [ -1, -1, -1, 3 ]
  }
]
output [
  {
    name: "feature_fusion/Conv_7/Sigmoid:0"
    data_type: TYPE_FP32
    dims: [ -1, -1, -1, 1 ]
  }
]
output [
  {
    name: "feature_fusion/concat_3:0"
    data_type: TYPE_FP32
    dims: [ -1, -1, -1, 5 ]
  }
]

name: "name" is an optional field, the value of which should match the name of the directory of the model.
backend: This field indicates which backend is being used to run the model. Triton supports a wide variety of backends like TensorFlow, PyTorch, Python, ONNX and more. For a complete list of field selection refer to these comments.
max_batch_size: As the name implies, this field defines the maximum batch size that the model can support.
input and output: The input and output sections specify the name, shape, datatype, and more, while providing operations like reshaping and support for ragged batches.

In most cases, it's possible to leave out the input and output sections and let Triton extract that information from the model files directly. Here, we've included them for clarity and because we'll need to know the names of our output tensors in the client application later on.

For details of all supported fields and their values, refer to the model config protobuf definition file.

Launching the server

With our repository created and our models configured, we're ready to launch the server. While the Triton Inference Server can be built from source, the use of pre-built Docker containers freely available from NGC is highly recommended for this example.

# Replace the yy.mm in the image name with the release year and month
# of the Triton version needed, eg. 22.08

docker run --gpus=all -it --shm-size=256m --rm -p8000:8000 -p8001:8001 -p8002:8002 -v $(pwd)/model_repository:/models nvcr.io/nvidia/tritonserver:<yy.mm>-py3

Once Triton Inference Server has been built or once inside the container, it can be launched with the command:

tritonserver --model-repository=/models

This will spin up the server and model instances will be ready for inference.

I0712 16:37:18.246487 128 server.cc:626]
+------------------+---------+--------+
| Model            | Version | Status |
+------------------+---------+--------+
| text_detection   | 1       | READY  |
| text_recognition | 1       | READY  |
+------------------+---------+--------+

I0712 16:37:18.267625 128 metrics.cc:650] Collecting metrics for GPU 0: NVIDIA GeForce RTX 3090
I0712 16:37:18.268041 128 tritonserver.cc:2159]
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                        |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                       |
| server_version                   | 2.23.0                                                                                                                                                                                       |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace |
| model_repository_path[0]         | /models                                                                                                                                                                                      |
| model_control_mode               | MODE_NONE                                                                                                                                                                                    |
| strict_model_config              | 1                                                                                                                                                                                            |
| rate_limit                       | OFF                                                                                                                                                                                          |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                    |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                     |
| response_cache_byte_size         | 0                                                                                                                                                                                            |
| min_supported_compute_capability | 6.0                                                                                                                                                                                          |
| strict_readiness                 | 1                                                                                                                                                                                            |
| exit_timeout                     | 30                                                                                                                                                                                           |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0712 16:37:18.269464 128 grpc_server.cc:4587] Started GRPCInferenceService at 0.0.0.0:8001
I0712 16:37:18.269956 128 http_server.cc:3303] Started HTTPService at 0.0.0.0:8000
I0712 16:37:18.311686 128 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002

Building a client application

Now that our Triton server has been launched, we can start sending messages to it. There are three ways to interact with the Triton Inference Server:

HTTP(S) API
gRPC API
Native C API

There are also pre-built client libraries in C++, Python, and Java that wrap over the HTTP and gRPC APIs. This example contains a Python client script in client.py which uses the tritonclient python library to communicate with Triton over the HTTP API.

Let's examine the contents of this file:

First, we import our HTTP client from the tritonclient library, as well as a few other libraries we'll use for processing our images:
```
import math
import numpy as np
import cv2
import tritonclient.http as httpclient
```

Next, we'll define a few helper functions for taking care of the pre and post processing steps for our pipeline. The details are omitted here for brevity, but you can check the client.py file for more details

def detection_preprocessing(image: cv2.Mat) -> np.ndarray:
  ...

def detection_postprocessing(scores: np.ndarray, geometry: np.ndarray, preprocessed_image: np.ndarray) -> np.ndarray:
  ...

def recognition_postprocessing(scores: np.ndarray) -> str:
  ...

Then, we create a client object, and initialize a connection with the Triton Inference Server.
```
client = httpclient.InferenceServerClient(url="localhost:8000")
```

Now, we'll create the InferInput that we'll be sending to Triton from our data.

raw_image = cv2.imread("./img2.jpg")
preprocessed_image = detection_preprocessing(raw_image)

detection_input = httpclient.InferInput("input_images:0", preprocessed_image.shape, datatype="FP32")
detection_input.set_data_from_numpy(preprocessed_image, binary_data=True)

Finally, we're ready to send an inference request to the Triton Inference Server and retrieve the response
```
detection_response = client.infer(model_name="text_detection", inputs=[detection_input])
```

After that, we'll repeat the process with the text recognition model, performing our next processing step, creating the input object, querying the server and finally performing postprocessing and printing the result.

# Process responses from detection model
scores = detection_response.as_numpy('feature_fusion/Conv_7/Sigmoid:0')
geometry = detection_response.as_numpy('feature_fusion/concat_3:0')
cropped_images = detection_postprocessing(scores, geometry, preprocessed_image)

# Create input object for recognition model
recognition_input = httpclient.InferInput("input.1", cropped_images.shape, datatype="FP32")
recognition_input.set_data_from_numpy(cropped_images, binary_data=True)

# Query the server
recognition_response = client.infer(model_name="text_recognition", inputs=[recognition_input])

# Process response from recognition model
text = recognition_postprocessing(recognition_response.as_numpy('308'))

print(text)

Let's try it out!

pip install tritonclient[http] opencv-python-headless
python client.py

You might have noticed that it's a bit redundant to retrieve the results of the first model only to do some processing and send them right back to Triton. In Part 5 of this tutorial we explore how you can move more processing steps to the server and execute multiple models in a single network call.

Model Versioning

The ability to deploy different versions of a model is essential to building an MLOps pipeline. The need arises from use cases like conducting A/B tests, easy model version rollbacks and more. Triton users can add a folder and the new model in the same repository:

model_repository/
├── text_detection
│   ├── 1
│   │   └── model.onnx
│   ├── 2
│   │   └── model.onnx
│   └── config.pbtxt
└── text_recognition
    ├── 1
    │   └── model.onnx
    └── config.pbtxt

By default Triton serves the "latest" model, but the policy to serve different versions of the model is customizable. For more information, refer this guide.

Loading & Unloading Models

Triton has model management API that can be used to control the model loading unloading policies. This API is extremely useful in cases where one or more models need to be loaded or unloaded without interrupting inference for other models being served on the same server. Users can select from one of three control modes:

NONE
EXPLICIT
POLL

tritonserver --model-repository=/models --model-control-mode=poll

The policies can also be set via command line arguments whilst launching the server. For more information, refer this section of the documentation.

What's next?

In this tutorial, we covered the very basics of setting up and querying a Triton Inference Server. This is Part 1 of a 6 part tutorial series that covers the challenges faced in deploying Deep Learning models to production. Part 2 covers Concurrent Model Execution and Dynamic Batching. Depending on your workload and experience you might want to jump to Part 5 which covers Building an Ensemble Pipeline with multiple models, pre and post processing steps, and adding business logic.