Custom Model
This document contains a list of steps to add additional models to the Nimble framework. By following these instructions you will expose your model to Nimble allowing you to deploy and subscribe to the model's metadata via the REST and WebSocket API. What follows is general instructions for adding custom models. Models are added in two steps:
- Their binary files are added to the
models
directory - Their pre- and post-processing functions are provided in a
python
file.
First we will look at adding the binary files to the models
directory, the structure of which is displayed below:
models
├── CPU
├── GPU
...
├── YoloV5Face.py
└── YoloV5.py
In addition to the hardware target directory there are also the pre- and post-processing python files associated with one or more models.
Within each of the hardware targets is a directory corresponding to a specific model.
Each of these directories store the necessary binary files to load the network.
Currently, cpu
and igpu
models share the sample directory (models/CPU
).
For example, the models/CPU
looks something like this:
models/CPU/
├── arcface
├── coco-large
...
├── yolov5s
└── yolov5s6
Adding a cpu
or igpu
Model
For the cpu
and igpu
hardware targets, the current supported deep learning framework is OpenVINO.
Each framework has a slightly different layout, so please refer to their respective sections.
Common to both frameworks is the modes/CPU/<name_of_model>/labels.txt
, this is just a newline delimited file with the class names (only if the model requires it).
OpenVINO
OpenVINO is the recommended deep learning framework since it achieves the high performance; however due to restrictions on the OpenVINO Model Optimiser it can be difficult to create the binary files.
To add a OpenVINO model to Nimble you will need the .xml
and .bin
files created by the OpenVINO Model Optimiser, we recommend that you use DL Workbench to attempt to convert your models.
Once you have the binary files you need to create the directory models/CPU/<name_of_model>
and place the .xml
and .bin
files in a nested FP32 directory, your directory structure should look like this:
models/CPU/<name_of_model>
├── FP32
│ └── <name_of_model>.xml
│ └── <name_of_model>.bin
└── labels.txt
If you are planning to use the igpu
you can provide a FP16 model like this:
models/CPU/<name_of_model>
├── FP32
│ └── <name_of_model>.xml
│ └── <name_of_model>.bin
├── FP16
│ └── <name_of_model>.xml
│ └── <name_of_model>.bin
└── labels.txt
If the FP16 model isn't present the igpu
will default to FP32.
Nimble does support running models in reduced precision modes other than FP16
, if this is some that you are interested in please contact your Megh representative.
Adding a gpu
model
Nimble leverages the Triton Inference Server to perform inference on NVIDIA GPUs. Triton supports multiple frameworks including Tensorflow, PyTorch, ONNX along with TensorRT. For frameworks which are not supported by Triton, such as MXNet, we recommend converting your model to ONNX.
gpu
models need to be placed in the models/GPU
directory and requires that you adhere to the Triton directory structure layout.
An example layout for an ONNX model is presented below:
models/GPU/<name_of_model>/
├── 1
│ └── model.onnx
├── config.pbtxt
└── labels.txt
The config.pbtxt
is the Triton configuration file, you can find more information on creating these files here.
Similar to cpu
and igpu
models, gpu
models also require a modes/GPU/<name_of_model>/labels.txt
, this is just a newline delimited file with the class names.
We have a variety of GPU models available with our release, please take a look at their config.pbtxt
to for example on how to enable TensorRT, FP16 precision and dynamic batching.
Creating the pre- and post-processing functions.
Models generally have different pre- and post-processing functions, some integrated into the model file itself while others are run as separate functions before/after data ingestion. To support functions run before and after data ingestion we provide a simple python API that is required to run the model. For a model that performs object detection the base of the class looks like this:
import numpy as np
from nimble.models.Detector import Detector
class <MODEL_NAME>(Detector):
models = ["<name_of_model>"]
@staticmethod
def preprocess(image):
...
@staticmethod
def postprocess(data, params):
...
First we need to import the Detector
class from Nimble and have our class inherit from it.
Next we create a models list, this is the different models directories that this class will support.
<name_of_model>
needs to be the same as the directory structure that holds the model binaries.
There are a few items to note:
- This class is device-agnostic. This means that if you use the same models (and
<name_of_model>
) for the bothCPU
andGPU
it shares the same pre- and post-processing. - The
models
field is a list, meaning that it can share the same functions across different models of the same hardware target. A simple example of this is the different version of EfficientDet.
Finally, we have def preprocess(image)
and def postprocess(data, params)
functions.
Nimble will call the def preprocess(image)
functions right before it issues the inference request.
Standard operations such as resize
, transpose
and datatype
conversion are automatically performed by Nimble.
The image
will be:
- of
type
:np.float32
- of
shape = (C, H, W)
orshape = (H, W, C)
depending on your model format.
For example, the YoloV5 def preprocess(image)
function is simply:
def preprocess(image):
image /= 255.0
The def postprocess(data, params)
function is called right after the results of the inference request have been received.
Nimble packages the data
into a dictionary:
data = {
"<out_blob_0>" : np.array(B, ...),
"<out_blob_1>" : np.array(B, ...),
...
}
Where <out_blob_N>
is the name of the output blob and np.array(B, ...)
is the data associated with that blob.
It is important to note that since Nimble is a streaming pipeline, the batch size will always be 1 (B == 1
).
If throughput mode is enabled, Nimble uses dynamic batching and asynchronous inference requests to ensure full utilisation of the available hardware resources.
B == 1
is kept to make integration of external reference post-processing functions easier.
Along with data
, Nimble will also passes in params
as a python dictionary:
params = {
"score" : float, # score_threshold
"iou" : float, # iou_threshold
"w" : int, # Model Ingestion Width
"h" : int, # Model Ingestion Height
"original_w" : int, # Original Image Width
"original_h" : int, # Original Image Height
}
Finally, the output of the def postprocess(data, params)
functions expects a numpy array with the following row structure:
[id, label_idx, score(conf), xmin, ymin, xmax, ymax]
In this case the id
is a user assigned value, but it is rarely used.
All of our pre- and post-processing python files are available and can be viewed here: <nimble_path>/models
More complex interactions are possible, for a concrete example please refer to the TinyYoloV3 model along with its implementation <nimble_path>/models/TinyYoloV3.py
.