Custom Model

This document contains a list of steps to add additional models to the Nimble framework. By following these instructions you will expose your model to Nimble allowing you to deploy and subscribe to the model's metadata via the REST and WebSocket API. What follows is general instructions for adding custom models. Models are added in two steps:

Their binary files are added to the models directory
Their pre- and post-processing functions are provided in a python file.

First we will look at adding the binary files to the models directory, the structure of which is displayed below:

models
├── CPU
├── GPU
...
├── YoloV5Face.py
└── YoloV5.py

In addition to the hardware target directory there are also the pre- and post-processing python files associated with one or more models.

Within each of the hardware targets is a directory corresponding to a specific model. Each of these directories store the necessary binary files to load the network. Currently, cpu and igpu models share the sample directory (models/CPU). For example, the models/CPU looks something like this:

models/CPU/
├── arcface
├── coco-large
...
├── yolov5s
└── yolov5s6

Adding a `cpu` or `igpu` Model

For the cpu and igpu hardware targets, the current supported deep learning framework is OpenVINO. Each framework has a slightly different layout, so please refer to their respective sections. Common to both frameworks is the modes/CPU/<name_of_model>/labels.txt, this is just a newline delimited file with the class names (only if the model requires it).

OpenVINO

OpenVINO is the recommended deep learning framework since it achieves the high performance; however due to restrictions on the OpenVINO Model Optimiser it can be difficult to create the binary files. To add a OpenVINO model to Nimble you will need the .xml and .bin files created by the OpenVINO Model Optimiser, we recommend that you use DL Workbench to attempt to convert your models.

Once you have the binary files you need to create the directory models/CPU/<name_of_model> and place the .xml and .bin files in a nested FP32 directory, your directory structure should look like this:

models/CPU/<name_of_model>
├── FP32
│   └── <name_of_model>.xml
│   └── <name_of_model>.bin
└── labels.txt

If you are planning to use the igpu you can provide a FP16 model like this:

models/CPU/<name_of_model>
├── FP32
│   └── <name_of_model>.xml
│   └── <name_of_model>.bin
├── FP16
│   └── <name_of_model>.xml
│   └── <name_of_model>.bin
└── labels.txt

If the FP16 model isn't present the igpu will default to FP32.

note

Nimble does support running models in reduced precision modes other than FP16, if this is some that you are interested in please contact your Megh representative.

Adding a `gpu` model

Nimble leverages the Triton Inference Server to perform inference on NVIDIA GPUs. Triton supports multiple frameworks including Tensorflow, PyTorch, ONNX along with TensorRT. For frameworks which are not supported by Triton, such as MXNet, we recommend converting your model to ONNX.

gpu models need to be placed in the models/GPU directory and requires that you adhere to the Triton directory structure layout. An example layout for an ONNX model is presented below:

models/GPU/<name_of_model>/
├── 1
│   └── model.onnx
├── config.pbtxt
└── labels.txt

The config.pbtxt is the Triton configuration file, you can find more information on creating these files here. Similar to cpu and igpu models, gpu models also require a modes/GPU/<name_of_model>/labels.txt, this is just a newline delimited file with the class names.

tip

We have a variety of GPU models available with our release, please take a look at their config.pbtxt to for example on how to enable TensorRT, FP16 precision and dynamic batching.

Creating the pre- and post-processing functions.

Models generally have different pre- and post-processing functions, some integrated into the model file itself while others are run as separate functions before/after data ingestion. To support functions run before and after data ingestion we provide a simple python API that is required to run the model. For a model that performs object detection the base of the class looks like this:

import numpy as np
from nimble.models.Detector import Detector

class <MODEL_NAME>(Detector):
    models = ["<name_of_model>"]

    @staticmethod
    def preprocess(image):
      ...

    @staticmethod
    def postprocess(data, params):
      ...

First we need to import the Detector class from Nimble and have our class inherit from it. Next we create a models list, this is the different models directories that this class will support. <name_of_model> needs to be the same as the directory structure that holds the model binaries. There are a few items to note:

This class is device-agnostic. This means that if you use the same models (and <name_of_model>) for the both CPU and GPU it shares the same pre- and post-processing.
The models field is a list, meaning that it can share the same functions across different models of the same hardware target. A simple example of this is the different version of EfficientDet.

Finally, we have def preprocess(image) and def postprocess(data, params) functions. Nimble will call the def preprocess(image) functions right before it issues the inference request. Standard operations such as resize, transpose and datatype conversion are automatically performed by Nimble.

The image will be:

of type: np.float32
of shape = (C, H, W) or shape = (H, W, C) depending on your model format.

For example, the YoloV5 def preprocess(image) function is simply:

    def preprocess(image):
        image /= 255.0

The def postprocess(data, params) function is called right after the results of the inference request have been received. Nimble packages the data into a dictionary:

data = {
  "<out_blob_0>" : np.array(B, ...),
  "<out_blob_1>" : np.array(B, ...),
  ...
}

Where <out_blob_N> is the name of the output blob and np.array(B, ...) is the data associated with that blob. It is important to note that since Nimble is a streaming pipeline, the batch size will always be 1 (B == 1). Nimble uses dynamic batching and asynchronous inference requests to ensure full utilisation of the available hardware resources. B == 1 is kept to make integration of external reference post-processing functions easier.

Along with data, Nimble will also pass in params as a python dictionary:

params = {
  "score"      : float,   # score_threshold
  "iou"        : float,   # iou_threshold
  "w"          : int,     # Model Ingestion Width
  "h"          : int,     # Model Ingestion Height
  "original_w" : int,     # Original Image Width
  "original_h" : int,     # Original Image Height
}

Finally, the output of the def postprocess(data, params) functions expects a numpy array with the following row structure:

[id, label_idx, score(conf), xmin, ymin, xmax, ymax]

In this case the id is a user assigned value, but it is rarely used.

note

All of our pre- and post-processing python files are available and can be viewed here: <nimble_path>/models

More complex interactions are possible, for a concrete example please refer to the TinyYoloV3 model along with its implementation <nimble_path>/models/TinyYoloV3.py.

Adding a cpu or igpu Model​

OpenVINO​

Adding a gpu model​

Creating the pre- and post-processing functions.​

Adding a `cpu` or `igpu` Model

OpenVINO

Adding a `gpu` model

Creating the pre- and post-processing functions.