当前位置：首页 > news >正文

TensorRT 5.1.5开发简介

news 2025/6/7 2:18:35

环境搭建
- 本机环境

CUDA:9.0(cat /usr/local/cuda/version.txt)

cudnn:7.2 (cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2)

需要升级cudnn到7.5，就可以使用最新版本的TensorRT 5.1.5。

TensorRT-5.1.5.0.Ubuntu-16.04.5.x86_64-gnu.cuda-9.0.cudnn7.5.tar.gz

下载安装cudnn7.5版本

从公司百度网盘上下载cudnn7.5：

cudnn-9.0-linux-x64-v7.5.0.56.tar

删除旧版本：

sudo rm -rf /usr/local/cuda/include/cudnn.h

sudo rm -rf /usr/local/cuda/lib64/libcudnn*

安装新版本：

解压tar包，cd进入cuda文件夹，

sudo cp include/cudnn.h /usr/local/cuda/include/

sudo cp lib64/lib* /usr/local/cuda/lib64/

查看本机cudnn版本号：

cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2

输出：

#define CUDNN_MAJOR 7

#define CUDNN_MINOR 5

#define CUDNN_PATCHLEVEL 0

#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)

#include "driver_types.h"

表示当前版本为cudnn7.5

下载TensorRT安装包

下载链接：https://developer.nvidia.com/nvidia-tensorrt-5x-download

安装方式：5.1GA(general availability，官方推荐使用)，Tar File Install Packages For Linux x86

安装包：TensorRT 5.1.5.0 GA for Ubuntu 16.04 and CUDA 9.0 tar package

安装TensorRT

cd /home/zhangjing/download

# 安装包解压

tar zxv TensorRT-5.1.5.0.Ubuntu-16.04.5.x86_64-gnu.cuda-9.0.cudnn7.5.tar.gz

# 更改环境变量

vim ~/.bashrc

export LD_LIBRARY_PATH=/home/zhangjing/download/TensorRT-5.1.5.0/lib:$LD_LIBRARY_PATH

export CUDA_INSTALL_DIR=/usr/local/cuda

export CUDNN_INSTALL_DIR=/usr/local/cuda

source ~/.bashrc

cd /home/zhangjing/download/TensorRT-5.1.5.0/python

sudo pip2 install tensorrt-5.1.5.0-cp27-none-linux_x86_64.whl

cd /home/zhangjing/download/TensorRT-5.1.5.0/uff

sudo pip2 install uff-0.6.3-py2.py3-none-any.whl

验证安装是否成功

# 编译所有示例

cd /home/zhangjing/download/TensorRT-4.0.1.6/samples

make -j8

# 注：也可单独编译某个示例

cd /home/zhangjing/download/TensorRT-4.0.1.6/samples/sampleMNIST

make -j8

# 运行示例，use caffe parser to import the MNIST model

cd /home/zhangjing/download/TensorRT-4.0.1.6/bin

./sample_mnist

开发流程
- 帮助文档

TensorRT library API:

https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/c_api/index.html

TensorRT Developer Guide:

https://docs.nvidia.com/deeplearning/sdk/tensorrt-archived/tensorrt-515/tensorrt-developer-guide/index.html

RPROI介绍
createRPNROIPlugin

createRPNROIPlugin：使用用户给定的参数，返回结合了Faster RCNN的RPN和ROI pooling的扩展层。此函数是RPROI的具体实现。

在API中的介绍：Returns a FasterRCNN fused RPN+ROI pooling plugin. Returns nullptr on invalid inputs.

https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/c_api/_nv_infer_plugin_8h.html#a79b440b214b692cbc8496027aa87c1e7

RPROI_TRT怎么使用

在TensorRT中，已实现了部分额外的扩展层：

RPROI_TRT

Normalize_TRT

PriorBox_TRT

GridAnchor_TRT

NMS_TRT

LReLU_TRT

Reorg_TRT

Region_TRT

Clip_TRT

要使用这些TensorRT注册的层，需要加载libnvinfer_plugin.so，注册以上的层。调用下面的代码可以实现此功能。

initLibNvInferPlugins(void* logger, const char* libNamespace)()

在sampleFasterRCNN中，程序开头调用了initLibNvInferPlugins，于是，可以直接在prototxt中定义层RPROI_TRT。直接在data\faster-rcnn\faster_rcnn_test_iplugin.prototxt中，使用RPROI替换RPN+ROI pooling。

layer {

name: "RPROIFused"

type: "RPROI"

bottom: 'rpn_cls_prob_reshape'

bottom: 'rpn_bbox_pred'

bottom: 'conv5_3'

bottom: 'im_info'

top: 'rois'

top: 'pool5'

region_proposal_param {

feature_stride: 16

prenms_top: 6000

nms_max_out: 300

anchor_ratio_count: 3

anchor_scale_count: 3

iou_threshold: 0.7

min_box_size: 16

anchor_ratio: 0.5

anchor_ratio: 1.0

anchor_ratio: 2.0

anchor_scale: 8.0

anchor_scale: 16.0

anchor_scale: 32.0

}

roi_pooling_param {

pooled_h: 7

pooled_w: 7

spatial_scale: 0.0625

}

在SSH中，proposal层可以用RPROI层直接定义。

RPROI层的输入参数

featureStride, preNmsTop, nmsMaxOut, iouThreshold, minBoxSize, spatialScale,

DimsHW(poolingH, poolingW), Weights{nvinfer1::DataType::kFLOAT, anchorsRatios, anchorsRatioCount},

Weights{nvinfer1::DataType::kFLOAT, anchorsScales, anchorsScaleCount}

RPROI的输出

top: 'rois'

top: 'pool5'

rois 是bbox的offsets to the center, height, and width的格式;

pool5是bbox对应的得分。

推断流程

以/tensorrt/samples/sampleFasterRNN为例进行说明。

其中，RPN和ROIPooling被优化为一层实现：RPNROIPlugin。这个层在TensorRT中的注册名为：RPROI_TRT。

输入输出

图片以.ppm的格式传入网络。好处是：每个像素的RGB值被存储为一个整数【0-255】。可以通过命令行工具（ImageMagick）将JPEG图片转化为ppm格式的图片。

readPPMFile(): 加载.ppm格式的图片

writePPMFileWithBBox(): 根据给定的bbox在ppm图片上标出一个1像素的红色边框。

定义网络

data/faster-rcnn/ faster_rcnn_test_iplugin.prototxt

此文件定义了类似于Faster RCNN的网络。但是，区别在于：合并了RPN 和 ROI pooling，在prototxt中，新层的名字叫RPROI。

layer {

name: "RPROIFused"

type: "RPROI"

bottom: 'rpn_cls_prob_reshape'

bottom: 'rpn_bbox_pred'

bottom: 'conv5_3'

bottom: 'im_info'

top: 'rois'

top: 'pool5'

region_proposal_param {

feature_stride: 16

prenms_top: 6000

nms_max_out: 300

anchor_ratio_count: 3

anchor_scale_count: 3

iou_threshold: 0.7

min_box_size: 16

anchor_ratio: 0.5

anchor_ratio: 1.0

anchor_ratio: 2.0

anchor_scale: 8.0

anchor_scale: 16.0

anchor_scale: 32.0

}

roi_pooling_param {

pooled_h: 7

pooled_w: 7

spatial_scale: 0.0625

}

创建builder和network

IBuilder* builder = createInferBuilder(gLogger);

nvinfer1::INetworkDefinition* network = builder->createNetwork();

创建parser

ICaffeParser* parser = createCaffeParser();

解析模型，生成network

const IBlobNameToTensor* blobNameToTensor = parser->parse(locateFile(deployFile).c_str(),

locateFile(modelFile).c_str(),

*network,

DataType::kFLOAT);

创建engine

builder->setMaxBatchSize(maxBatchSize); // maxBatchSize是每次推断处理的图片数，本例是5

builder->setMaxWorkspaceSize(1 << 20); //建议10 * (2^20)，也就是10MB/5张图片

ICudaEngine* engine = builder->buildCudaEngine(*network);

不用时，需要回收资源：

parser->destroy();

network->destroy();

builder->destroy();

序列化engine与反序列化

IHostMemory *serializedModel = engine->serialize();

// store model to disk

// <…>

serializedModel->destroy();

反序列化：

IRuntime* runtime = createInferRuntime(gLogger);

ICudaEngine* engine = runtime->deserializeCudaEngine(modelData, modelSize, nullptr);

序列化是为了将模型保存在内存中，以待后续使用；反序列化是为了使用模型推断时从内存加载模型。从内存加载要比从prototxt和caffemodel中加载快多了。

推断

1.创建context，以保存推断过程中产生的中间变量。

IExecutionContext *context = engine->createExecutionContext();

一个engine可以有多个context同时运行，以此实现一个网络的多个重叠推断任务。例如：通过同一个GPU生成多个context和engine，生成多个CUDA streams，每个stream拥有一个独立的context和engine，如此，可以实现并行推断多个图片。

2.根据输入输出层的名字，获取相关的层的index

int inputIndex = engine->getBindingIndex(INPUT_BLOB_NAME);

int outputIndex = engine->getBindingIndex(OUTPUT_BLOB_NAME);

3.根据获取到的indices, 生成GPU缓存和一个stream

void* buffers[5];

CHECK(cudaMalloc(&buffers[inputIndex0], dataSize * sizeof(float))); // data

CHECK(cudaMalloc(&buffers[inputIndex1], imInfoSize * sizeof(float))); // im_info

CHECK(cudaMalloc(&buffers[outputIndex0], bboxPredSize * sizeof(float))); // bbox_pred

CHECK(cudaMalloc(&buffers[outputIndex1], clsProbSize * sizeof(float))); // cls_prob

CHECK(cudaMalloc(&buffers[outputIndex2], roisSize * sizeof(float))); // rois

cudaStream_t stream;

CHECK(cudaStreamCreate(&stream));

4.tensorRT是典型的异步方式，在流中进行推断

context->enqueue(batchSize, buffers, stream, nullptr);

在kernel调用前后，一般要调用memcpy()来移动数据。enqueue()的参数是一个可以设置的CUDA event，当buffers被填充满或者其内存被安全重置，该event可以被触发。

通过evnets与等待流的同步操作，可以决定kernel或memcpy()是否完成。

处理网络输出数据

首先，bbox是以offsets to the center, height, and width的格式给出的，需要通过除以输入的imInfo.scale被还原放大到原图空间.

其次，给bbox数据做逆变换，并剪切，使bbox的范围不超过原图。

最后，用nms（non-maximum suppression）算法去掉重叠的bbox。

由于以上后处理代码既非计算密集，也非存储密集，所以在cpu上定义。

最终输出结果是分类、置信得分、4个坐标。并且使用函数writePPMFileWithBBox在输出的图片上用红色1像素的框标记出来。

运行sample_fasterRCNN

下载模型faster_rcnn_models.tgz

下载地址

https://www.soupan8.com/file/23544645

在data/faster-rcnn下执行命令：

tar zxvf faster_rcnn_models.tgz -C ./ --strip-components=1 --exclude=ZF_*

到bin目录下执行

./ sample_fasterRCNN

执行过程如下，输出图片保存在bin目录下。

查看person-0.974725.ppm

添加自定义层

IPluginV2Ext是自定义层需要继承的基类。

IPluginCreator是在网络构建阶段生成自定义层的类，并可以在推断时反序列化自定义层。

IPluginCreator::createPlugin()：创建一个plugin object;
addPluginV2(): 创建层并添加层到网络，并绑定层和plugin；
addPluginV2()可以返回一个IPluginV2Layer对象的指针，通过指针可以操作层的数据。

给网络添加一个自定义层

（name: pluginName, version: pluginVersion）示例代码如下：

//Use the extern function getPluginRegistry to access the global TensorRT Plugin Registry

auto creator = getPluginRegistry()->getPluginCreator(pluginName, pluginVersion);

const PluginFieldCollection* pluginFC = creator->getFieldNames();

//populate the field parameters (say layerFields) for the plugin layer

PluginFieldCollection *pluginData = parseAndFillFields(pluginFC, layerFields);

//create the plugin object using the layerName and the plugin meta data

IPluginV2 *pluginObj = creator->createPlugin(layerName, pluginData);

//add the plugin to the TensorRT network using the network API

auto layer = network.addPluginV2(&inputs[0], int(inputs.size()), pluginObj);

… (build rest of the network and serialize engine)

pluginObj->destroy() // Destroy the plugin object

… (destroy network, engine, builder)

… (free allocated pluginData)

在序列化时，TensorRT engine将存储所有IPluginV2 类型的plugin type, plugin version, 和 namespace(if exists)。在反序列化时，TensorRT通过搜索这些信息从plugin registry来找到对应的plugin creator。这使得TensorRT engine可以调用IPluginCreator::deserializePlugin()。在反序列化过程中，创建的plugin将会被engine调用IPluginV2::destroy()来销毁。

在旧版本的TensorRT反序列化过程中，我们需要继承nvinfer1::IpluginFactory才可以调用createPlugin。但是在新版本的TensorRT中，对于被注册并且用addPluginV2来添加的Plugin来说，不需要再继承nvinfer1::IpluginFactory这个类了。