tensorRT这个就叫量化:好像训练默认的精度就是FP32(单精度)(FP64是双精度),tensorRT默认就是将其量化为FP16(单精度),也可以设置,将其量化为int8
知乎的一个参考:这里
TensorRT Plguin的一些东西(暂时不是很懂这个):这里
版本:GA(general availability)代表正式版;EA(early access)代表测试版
netron:查看网络模型,可以在这地址里搜索一下netron的简单用法,这里还有openlab关于部署、pytorch转onnx模型,onnx修改等。
ONNX查看器,带修改版本,地址。
英伟达的一个B站官方教程,还有配套代码。
安装这部分很久以前写的了,意义不大,看一看就好,也没去改了。
-
下载:直接官网(7.x版本)去下载好
.tar.gz
版本,比如:TensorRT-7.2.3.4.CentOS-7.9.x86_64-gnu.cuda-10.2.cudnn8.1.tar.gz
(8.x版本) -
安装:直接把这个包解压到一个地方,会得到一个名为
TensorRT-7.2.3.4
的文件夹-
添加环境变量:就是把上面这个文件夹的路径,假如是 /user/local/TensorRT-7.2.3.4/,那么就是
vim ~/.bashrc # 可以写进这个配置文件,也可以直接新建一个文件写到里面 vim /etc/profile.d/tensorRT.sh # 后面的名字是自己起的,内容就是下面
-
>#(动态库搜索路径)
export LD_LIBRARY_PATH=/user/local/TensorRT-7.2.3.4/lib:$LD_LIBRARY_PATH
>#(静态库搜索路径)
export LIBRARY_PATH=/user/local/TensorRT-7.2.3.4/lib::$LIBRARY_PATH
>#c++程序头文件搜索路径
export CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/opt/TensorRT-7.2.3.4/include
完了后,记得`source ~/.bashrc` 不然会说文件找不到,再不行就把ssh断开重新连接。
Ps:尽量根据TensorRT的名字,把cuda和cudnn的版本对应起来。
比如构建这个项目:torch2trt的环境,先安装了python版的tensorrt后,然后按照它的README,在做 python setup.py develop 或 pip install -v -e . 时 可能就会报错(就是因为没有添加tensorrt的环境变量),然后就有一个错“fatal error: NvInfer.h: No such file or directory”,然后把头文件路径export添加进去后,继续它的README,又会“cannot find -lnvinfer”,然后export动态库路径后还是不行,再export静态库路径就可以了。更多的可以去看看GCC编译器的笔记。
win的版本就是直接把压缩包解压后放那里用头文件和库文件就好了。
参考:这里
-
如果要使用 Python 接口的 TensorRT,则需要安装 Pycuda
pip install pycuda
错误解决:
- 前提条件,gcc-4.8的版本是不行的(会看到gcc的error),一直报错,然后我centos安装了gcc-8就可以直接pip安装了,这时候调用pycuda时;
- 可能会报一个错误==ImportError: libnvinfer.so.7: cannot open shared object file: No such file or directory==,那就是还要把tensorrt的动态库路径添加到环境变量中去。
- pycuda因为版本的问题,cuda10.2是支持的比较好的,然后即便服务器是安装了cuda11.1,也是会报一个错no such file==libcudart.so.10.2==,所以可以自己创建一个cuda10.2的容器,把里面的文件libcudart.so.10.2直接复制到/usr/local/cuda-11.1/targets/x86_64-linux/lib中去,就可以使用了。
下面是安装:比如它的路径是:/opt/TensorRT-7.2.3.4/,,那就先cd进去
-
针对python中
import tensorrt
# 在pypi中找的tensorrt的包不对cd ./python pip install tensorrt-7.2.3.4-cp37-none-linux_x86_64.whl # 里面还有一些其它不同python版本 # 后面使用就是 import tensorrt as trt
-
安装UFF,支持tensorflow模型转化
cd ./uff pip install uff-0.6.9-py2.py3-none-any.whl
-
安装graphsurgeon,支持自定义结构
cd ./graphsurgeon pip install graphsurgeon-0.4.5-py2.py3-none-any.whl
看sky_hole的B站视频。
还有它对应的一个github项目,里面的代码,各种层更加完整,完全可以参考。
首先网络用的resnet18.pth,再将其导出为resnet18.onnx(方便netron查看),然后是使用的qt写的(仅c++应用,没要ui)。
注:一般网上下载的.pth文件都是用的torch.save(net.state_dict(), "123.pth"),这就只保存了key-value,而没有网络结构,这种model = torch.load(path, map_location=torch.device("cpu"))的model是没办法直接导出成onnx格式的,那要拿到它的网络结构(假设这个网络结构类的实例对象叫mdoel_net),那就要
model_net.load_state_dict(torch.load(path, map_location=torch.device("cpu"))),这个得到的model_net才能直接像下面这样导成onnx。
-
获取resnet18网络:(在用pytorch保存得到onnx模型时:一定要加参数
training=2
)if __name__ == '__main__': model = torchvision.models.resnet18(pretrained=False) print(model) # onnx看起有问题时把这结构打出来看,还是有区别,并不完全一样 torch.save(model.state_dict(), "./resnet18.pth") model = model.cuda() dummy_input = torch.ones(1, 3, 256, 256, dtype=torch.float32).cuda() # onnx格式用netron看起来格式更好,比.pth好很多 # 一定要加 training=2 这个参数,不然batchnormal会被融合,就看不到这层了,而且每层的名字都是数字 torch.onnx.export(model, dummy_input, "./resnet18.onnx", verbose=True, training=2)
然后把这resnet18.pth解析出来保存在文件夹中:
import os import struct import torch import torchvision torch.cuda.set_device(0) def getWeights(model_path): state_dict = torch.load(model_path, map_location=lambda storage, loc:storage) keys = [value for key, value in enumerate(state_dict)] weights = dict() for key in keys: weights[key] = state_dict[key] return weights, keys def extract(weights, keys, weights_path): if not os.path.exists(weights_path): os.mkdir(weights_path) for key in keys: print(key) value = weights[key] Shape = value.shape allsize = 1 for idx in range(len(Shape)): allsize *= Shape[idx] Value = value.reshape(allsize) with open(weights_path + key + ".wgt", "wb") as fp: a = struct.pack("i", allsize) fp.write(a) for i in range(allsize): a = struct.pack("f", Value[i]) fp.write(a) if __name__ == '__main__': weights, keys = getWeights("./resnet18.pth") extract(weights, keys, "./trt_weights/") # 把每层的权重这些保存到了这个文件夹里,后续要用
-
qt中的.pro文件的编写: win32不是必须的,在window上可加,linux上去掉就是。
TEMPLATE = app CONFIG += console c++11 CONFIG -= app_bundle CONFIG -= qt win32 { INCLUDEPATH += \ 'E:\lib\TensorRT-7.2.3.4.Windows10.x86_64.cuda-10.2.cudnn8.1\include' \ 'C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\include' } win32 { LIBS += \ -L'E:\lib\TensorRT-7.2.3.4.Windows10.x86_64.cuda-10.2.cudnn8.1\lib' nvinfer.lib \ -L'C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\lib\x64' cudart.lib } SOURCES += \ main.cpp \ tensorrt.cpp HEADERS += \ tensorrt.h
-
tensorrt的注意点:
-
1.tensorRT工程生成的.engine文件在不同的显卡上不能通用,这是因为硬件的内部构造不同决定的,所以当有多张不同显卡时,要指定显卡的索引。
-
2.一个trt工程有且仅有一个Logger,所有的trt日志只会从这个Logger接口输出。例如: error/warning/info等,推荐使用继承的方式自定义logger:在tensorrt.h中
#include <NvInfer.h> class Logger : public nvinfer1::ILogger { public: void log(nvinfer1::ILogger::Severity severity, const char* msg) override { if (severity == Severity::kINFO) return; switch(severity) { case Severity::kINTERNAL_ERROR: std::cerr << "kINTERNAL_ERROR: "; break; case Severity::kERROR: std::cerr << "ERROR: "; break; case Severity::kWARNING: std::cerr << "kWARNING: "; break; case Severity::kINFO: std::cerr << "kINFO: "; break; default: std::cerr << "UNKNOWN: "; break; } std::cerr << msg << std::endl; } };
-
代码里的注释写得非常清晰明了了,3.2文件的这三个文件,结合上面的Logger类是可以成功运行的
里面完善了各种常用层的一个定义,就实现resnet18这网络来说,是没有用完的,不影响。
#include <NvInfer.h>
// 为了方便写shuffle层而写的结构体
struct shuffle {
nvinfer1::Dims reshape;
nvinfer1::Permutation permute;
};
class tensorRT {
public:
tensorRT();
void createENG(std::string engPath); // 这里面实现各层网络的组合
// 0.加载权重
std::vector<float> loadWeoghts(const std::string &weightPath);
// 1.卷积层(每层出来得到的类型肯定都是tensor)
nvinfer1::ITensor* trt_conv(std::string inputLayerName, std::string weightsName, std::string biasPath, int output_c, int kernel, int stride, int padding);
// 2.batchnormal层(m_network这里是不带的,用的其scale来改造的)
nvinfer1::ITensor* trt_batchnormal(std::string inputLayerName, std::string weightsName);
// 3.激活层(relu、leak_relu、sigmoid....很多的激活)
nvinfer1::ITensor* trt_activation(std::string inputLayerName, std::string activate_type);
// 4.池化(这是没有权重文件的)
nvinfer1::ITensor* trt_pool(std::string inputLayerName, std::string pool_type, int kernel, int stride, int padding);
// 5.tensoer的add、或者相减、相除这些操作
nvinfer1::ITensor* trt_calculate(std::string inputLayerName1, std::string inputLayerName2, std::string cal_type);
// 6.fc层:即全连接层,
nvinfer1::ITensor* trt_fc(std::string inputLayerName, std::string weightsName, std::string biasName, int out_features);
/** 以下层是最后一个视频补充的,跑通前面的demo暂时没用到 **/
// 7.两个矩阵相乘
nvinfer1::ITensor* trt_matmul(std::string inputLayerName1, std::string inputLayerName2);
// 8. softmax:这输出是两个,一个是置信度,一个是类别,,但一次只能输出一个,由dim决定(dim只会是o或1)
nvinfer1::ITensor* trt_softmax(std::string inputLayerName, int dim);
// 9.concate:去到nvinfer1::INetworkDefinition类里面,找到addConcatenation虚函数,会发现它要的输入是一个数组ITensor* const* inputs
// 我们一般的操作还是先用vector存储好,然后再new一个数组,再把vector中的元素一个个复制进去
nvinfer1::ITensor* trt_concate(std::vector<std::string> inputLayerNames, int axis);
// 10.slice:这个好像做了比较多的假设,写的比较固定,真要用时,当做一个参考,不一定对;有哪些参数也去看其原虚函数
nvinfer1::ITensor* trt_slice(std::string inputLayerName, std::vector<int>start, std::vector<int>outputSize, std::vector<int>step);
// 11.shuffle:tensort的shuffle层可以只做rehsape(即view)或permute(即transpose);也可以都做,都做也就需要指定谁先谁后
// 为了方便一些本应该设置为参数的值我直接写到函数实现了,自己到时候酌情改参数吧
shuffle m_shuffle; // 这是上面的自定义结构体
nvinfer1::ITensor* trt_shuffle(std::string inputLayerName, std::vector<int> reshapeSize, std::vector<int> permuteSize);
// 12.添加一个常量层到神经网络,这样来实现一个常量(下面aplha参数)乘以一个tensoer的操作
nvinfer1::ITensor* trt_constant(std::vector<int> dimensions, float alpha);
std::string rootPath = "E:/project/Pycharm_project/trt_study/trt_weights/";
Logger m_logger;
// 定义一个个网络结构,一切都是根据这来的
nvinfer1::INetworkDefinition *m_network;
/*
上面的每层函数,第一个参数都是std::string inputLayerName,讲道理每层的输入应该是tensoer,
所以这里就用了一个map将tensor和名字对应了起来,去取tensor
*/
std::map<std::string, nvinfer1::ITensor*> Layers;
private:
void print_tensor_size(std::string layerName, nvinfer1::ITensor *input_tensor);
};
#include <iostream>
#include <fstream>
#include "tensorrt.h"
tensorRT::tensorRT() { }
void tensorRT::print_tensor_size(std::string layerName, nvinfer1::ITensor *input_tensor) {
std::cout << layerName.c_str() << ": ";
// 打印维度,基本上这些api就是这些,记住,,因为我们的输入n是1,所以这里只会打印一次。
for (int i = 0; i < input_tensor->getDimensions().nbDims; i++) {
std::cout << input_tensor->getDimensions().d[i] << " ";
}
std::cout << std::endl;
}
std::vector<float> tensorRT::loadWeoghts(const std::string &weightPath) {
int size = 0;
std::ifstream file(weightPath, std::ios::in | std::ios::binary);
if (!file.is_open()) {
std::cout << "\nError: " << weightPath.c_str() << " " << "can not open!\n" << std::endl;
// 实际这里应该直接返回了,因为打开始失败
}
file.read((char*)&size, 4 );
char* floatWeights = new char[size*4];
float *fp = (float*)floatWeights;
file.read(floatWeights, size*4);
std::vector<float> weights(fp, fp+size);
delete[] floatWeights;
file.close();
return weights;
}
void tensorRT::createENG(std::string engPath) {
int input_c = 3;
int input_h = 256;
int input_w = 256;
// 创建引擎就要这个Builder,下面推理时就要runtime
nvinfer1::IBuilder *builder = nvinfer1::createInferBuilder(this->m_logger);
this->m_network = builder->createNetwork();
// tensor
// 输入起个名字,装载到网络中去
nvinfer1::ITensor *input = this->m_network->addInput("data", nvinfer1::DataType::kFLOAT,
nvinfer1::DimsCHW(static_cast<int>(input_c),
static_cast<int>(input_h),
static_cast<int>(input_w)));
// 网络输入开始写
this->Layers["input"] = input;
/*
当onnx看起来有问题是,用Python,把那个模型打印出来,看
model = torchvision.models.resnet50(pretrained=False)
print(model) # 结构也很清晰,然后某一个比如conv的卷积中,没看到padding,那就是0
第一行就是:(conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
因为没有bias,所以这里bias的路径就给的空(第三个参数),,输出通道是64
卷积给的权重的路径就是整个权重文件的名字
batchnormal层,给的权重文件的路径就是其名字的前半截,在其对应函数中还去拼接了,因为它有meanValue、varValue等
*/
this->Layers["conv1"] = this->trt_conv("input", "conv1.weight.wgt", "", 64, 7, 2, 3);
this->Layers["batchNormal1"] = this->trt_batchnormal("conv1", "bn1"); // 这一层的输入就是上一层的"conv1"
this->Layers["relu1"] = this->trt_activation("batchNormal1", "relu");
this->Layers["maxPool1"] = this->trt_pool("relu1", "max", 3, 2, 1);
// 下面就是残差层
// layer1
this->Layers["layer1.0.conv1"] = this->trt_conv("maxPool1", "layer1.0.conv1.weight.wgt", "", 64, 3, 1, 1);
this->Layers["layer1.0.bn1"] = this->trt_batchnormal("layer1.0.conv1", "layer1.0.bn1"); // batchnormal层因为有几个权重文件,就只给了前面的前缀
this->Layers["layer1.0.relu1"] = this->trt_activation("layer1.0.bn1", "relu");
this->Layers["layer1.0.conv2"] = this->trt_conv("layer1.0.relu1", "layer1.0.conv2.weight.wgt", "", 64, 3, 1, 1);
this->Layers["layer1.0.bn2"] = this->trt_batchnormal("layer1.0.conv2", "layer1.0.bn2");
/*
这里面是resnet50的写法
this->Layers["layer1.0.conv3"] = this->trt_conv("layer1.0.relu2", "layer1.0.conv3.weight.wgt", "", 256, 1, 1, 0);
this->Layers["layer1.0.bn3"] = this->trt_batchnormal("layer1.0.conv3", "layer1.0.bn3");
// 这里开始是layer1的downsample,看onnx图,这里的输入是最上面最大池化后的
this->Layers["layer1.0.downsample.0"] = this->trt_conv("maxPool1", "layer1.0.downsample.0.weight.wgt", "", 256, 1, 1, 0);
this->Layers["layer1.0.downsample.1"] = this->trt_batchnormal("layer1.0.downsample.0", "layer1.0.downsample.1");
// 然后两个tensort的add操作(layer1.add名字自己取的)
this->Layers["layer1.add"] = this->trt_calculate("layer1.0.bn3", "layer1.0.downsample.1", "add");
this->Layers["layer1.relu1"] = this->trt_activation("layer1.add", "relu");
*/
// 然后两个tensort的add操作(layer1.add名字自己取的)
this->Layers["layer1.add"] = this->trt_calculate("maxPool1", "layer1.0.bn2", "add");
this->Layers["layer1.relu1"] = this->trt_activation("layer1.add", "relu");
// 以上部分就是pth中layer1中的(0): Bottleneck的部分,,onnx看起来跟直接打印出来的pth结构还是有点不一样
// layer1.1
this->Layers["layer1.1.conv1"] = this->trt_conv("layer1.relu1", "layer1.1.conv1.weight.wgt", "", 64, 3, 1, 1);
this->Layers["layer1.1.bn1"] = this->trt_batchnormal("layer1.1.conv1", "layer1.1.bn1");
this->Layers["layer1.1.relu1"] = this->trt_activation("layer1.1.bn1", "relu");
this->Layers["layer1.1.conv2"] = this->trt_conv("layer1.1.relu1", "layer1.1.conv2.weight.wgt", "", 64, 3, 1, 1);
this->Layers["layer1.1.bn2"] = this->trt_batchnormal("layer1.1.conv2", "layer1.1.bn2");
// add
this->Layers["layer1.1.add"] = this->trt_calculate("layer1.relu1", "layer1.1.bn2", "add");
this->Layers["layer1.1.relu2"] = this->trt_activation("layer1.1.add", "relu");
// layer2
this->Layers["layer2.0.conv1"] = this->trt_conv("layer1.1.relu2", "layer2.0.conv1.weight.wgt", "", 128, 3, 2, 1);
this->Layers["layer2.0.bn1"] = this->trt_batchnormal("layer2.0.conv1", "layer2.0.bn1");
this->Layers["layer2.0.relu1"] = this->trt_activation("layer2.0.bn1", "relu");
this->Layers["layer2.0.conv2"] = this->trt_conv("layer2.0.relu1", "layer2.0.conv2.weight.wgt", "", 128, 3, 1, 1);
this->Layers["layer2.0.bn2"] = this->trt_batchnormal("layer2.0.conv2", "layer2.0.bn2");
// downsample
this->Layers["layer2.0.downsample.0"] = this->trt_conv("layer1.1.relu2", "layer2.0.downsample.0.weight.wgt", "", 128, 1, 2, 0);
this->Layers["layer2.0.downsample.1"] = this->trt_batchnormal("layer2.0.downsample.0", "layer2.0.downsample.1");
// add
this->Layers["layer2.add"] = this->trt_calculate("layer2.0.bn2", "layer2.0.downsample.1", "add");
this->Layers["layer2.relu1"] = this->trt_activation("layer2.add", "relu");
// layer2.1
this->Layers["layer2.1.conv1"] = this->trt_conv("layer2.relu1", "layer2.1.conv1.weight.wgt", "", 128, 3, 1, 1);
this->Layers["layer2.1.bn1"] = this->trt_batchnormal("layer2.1.conv1", "layer2.1.bn1");
this->Layers["layer2.1.relu1"] = this->trt_activation("layer2.1.bn1", "relu");
this->Layers["layer2.1.conv2"] = this->trt_conv("layer2.1.relu1", "layer2.1.conv2.weight.wgt", "", 128, 3, 1, 1);
this->Layers["layer2.1.bn2"] = this->trt_batchnormal("layer2.1.conv2", "layer2.1.bn2");
// add
this->Layers["layer2.1.add"] = this->trt_calculate("layer2.relu1", "layer2.1.bn2", "add");
this->Layers["layer2.1.relu1"] = this->trt_activation("layer2.1.add", "relu");
// layer3
this->Layers["layer3.0.conv1"] = this->trt_conv("layer2.1.relu1", "layer3.0.conv1.weight.wgt", "", 256, 3, 2, 1);
this->Layers["layer3.0.bn1"] = this->trt_batchnormal("layer3.0.conv1", "layer3.0.bn1");
this->Layers["layer3.0.relu1"] = this->trt_activation("layer3.0.bn1", "relu");
this->Layers["layer3.0.conv2"] = this->trt_conv("layer3.0.relu1", "layer3.0.conv2.weight.wgt", "", 256, 3, 1, 1);
this->Layers["layer3.0.bn2"] = this->trt_batchnormal("layer3.0.conv2", "layer3.0.bn2");
// downsample
this->Layers["layer3.0.downsample.0"] = this->trt_conv("layer2.1.relu1", "layer3.0.downsample.0.weight.wgt", "", 256, 1, 2, 0);
this->Layers["layer3.0.downsample.1"] = this->trt_batchnormal("layer3.0.downsample.0", "layer3.0.downsample.1");
// add
this->Layers["layer3.0.add"] = this->trt_calculate("layer3.0.bn2", "layer3.0.downsample.1", "add");
this->Layers["layer3.0.relu1"] = this->trt_activation("layer3.0.add", "relu");
// layer3.1
this->Layers["layer3.1.conv1"] = this->trt_conv("layer3.0.relu1", "layer3.1.conv1.weight.wgt", "", 256, 3, 1, 1);
this->Layers["layer3.1.bn1"] = this->trt_batchnormal("layer3.1.conv1", "layer3.1.bn1");
this->Layers["layer3.1.relu1"] = this->trt_activation("layer3.1.bn1", "relu");
this->Layers["layer3.1.conv2"] = this->trt_conv("layer3.1.relu1", "layer3.1.conv2.weight.wgt", "", 256, 3, 1, 1);
this->Layers["layer3.1.bn2"] = this->trt_batchnormal("layer3.1.conv2", "layer3.1.bn2");
// add
this->Layers["layer3.1.add"] = this->trt_calculate("layer3.0.relu1", "layer3.1.bn2", "add");
this->Layers["layer3.1.relu1"] = this->trt_activation("layer3.1.add", "relu");
// layer4
this->Layers["layer4.0.conv1"] = this->trt_conv("layer3.1.relu1", "layer4.0.conv1.weight.wgt", "", 512, 3, 2, 1);
this->Layers["layer4.0.bn1"] = this->trt_batchnormal("layer4.0.conv1", "layer4.0.bn1");
this->Layers["layer4.0.relu1"] = this->trt_activation("layer4.0.bn1", "relu");
this->Layers["layer4.0.conv2"] = this->trt_conv("layer4.0.relu1", "layer4.0.conv2.weight.wgt", "", 512, 3, 1, 1);
this->Layers["layer4.0.bn2"] = this->trt_batchnormal("layer4.0.conv2", "layer4.0.bn2");
// downsample
this->Layers["layer4.0.downsample.0"] = this->trt_conv("layer3.1.relu1", "layer4.0.downsample.0.weight.wgt", "", 512, 1, 2, 0);
this->Layers["layer4.0.downsample.1"] = this->trt_batchnormal("layer4.0.downsample.0", "layer4.0.downsample.1");
// add
this->Layers["layer4.0.add"] = this->trt_calculate("layer4.0.bn2", "layer4.0.downsample.1", "add");
this->Layers["layer4.0.relu1"] = this->trt_activation("layer4.0.add", "relu");
// layer4.1
this->Layers["layer4.1.conv1"] = this->trt_conv("layer4.0.relu1", "layer4.1.conv1.weight.wgt", "", 512, 3, 1, 1);
this->Layers["layer4.1.bn1"] = this->trt_batchnormal("layer4.1.conv1", "layer4.1.bn1");
this->Layers["layer4.1.relu1"] = this->trt_activation("layer4.1.bn1", "relu");
this->Layers["layer4.1.conv2"] = this->trt_conv("layer4.1.relu1", "layer4.1.conv2.weight.wgt", "", 512, 3, 1, 1);
this->Layers["layer4.1.bn2"] = this->trt_batchnormal("layer4.1.conv2", "layer4.1.bn2");
// add
this->Layers["layer4.1.add"] = this->trt_calculate("layer4.0.relu1", "layer4.1.bn2", "add");
this->Layers["layer4.1.relu1"] = this->trt_activation("layer4.1.add", "relu"); // 这层的形状打印出来看是:(512,8,8)
// avgpool:在python这层显示 (avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
// 意思是最终输出的size是(1, 1),那这层的卷积核就是用(8, 8),步长就无所谓了
this->Layers["globalAvgPool"] = this->trt_pool("layer4.1.relu1", "average", 8, 1, 0);
// fc:全连接层 (最后的out_features=1000是网络定的)
this->Layers["fc"] = this->trt_fc("globalAvgPool", "fc.weight.wgt", "fc.bias.wgt", 1000);
// 让最后一层作为输出层
this->Layers["fc"]->setName("output");
this->m_network->markOutput(*this->Layers["fc"]); // 就这两行
builder->setMaxBatchSize(20); // 设置一些属性
builder->setMaxWorkspaceSize(1<<30); // 1G
std::cout << "engine init ..." << std::endl;
nvinfer1::ICudaEngine *engine = builder->buildCudaEngine(*this->m_network);
/*
yolov5的tensorrt用到的是 config传进来的参数,是
nvinfer1::ICudaEngine *engine = builder->buildEngineWithConfig(*network, *config);
其中config是 nvinfer1::IBuilderConfig *config
// Engine config
builder->setMaxBatchSize(maxBatchSize);
config->setMaxWorkspaceSize(16 * (1 << 20)); // 16MB 然后用congfig来设置这些属性
#if defined(USE_FP16)
config->setFlag(nvinfer1::BuilderFlag::kFP16);
#elif defined(USE_INT8)
std::cout << "Your platform support int8: " << (builder->platformHasFastInt8() ? "true" : "false") << std::endl;
assert(builder->platformHasFastInt8());
config->setFlag(nvinfer1::BuilderFlag::kINT8); // config来设置标志
Int8EntropyCalibrator2 *calibrator = new Int8EntropyCalibrator2(1, kInputW, kInputH, "./coco_calib/", "int8calib.table", kInputTensorName);
config->setInt8Calibrator(calibrator);
#endif
*/
nvinfer1::IHostMemory *modelStream = engine->serialize(); // Serialize the engine
// 写成 .engine 引擎文件
// 其实 ofstream 已经表明是输出了,就不需要std::ios::out,除非是std::fstream,就需要这样写
std::ofstream engFile;
engFile.open(engPath, std::ios::out | std::ios::binary);
engFile.write(static_cast<const char*>(modelStream->data()), modelStream->size());
this->m_network->destroy();
engine->destroy();
builder->destroy();
modelStream->destroy();
}
nvinfer1::ITensor* tensorRT::trt_conv(std::string inputLayerName, std::string weightsName,
std::string biasPath, int output_c, int kernel, int stride, int padding) {
std::vector<float> weights;
std::vector<float> bias;
weights = this->loadWeoghts(this->rootPath + weightsName);
if (biasPath != "") { // bias可能没有
bias = loadWeoghts(biasPath);
}
int size = weights.size();
nvinfer1::Weights conWeights {nvinfer1::DataType::kFLOAT, nullptr, size}; // 这里只能用花括号,不能用()
nvinfer1::Weights conBias {nvinfer1::DataType::kFLOAT, nullptr, output_c};
float *val_wt = new float[size];
for (int i = 0; i < size; i++) {
val_wt[i] = weights[i];
}
conWeights.values = val_wt;
float *val_bias = new float[output_c];
for (int i = 0; i < output_c; i++) { // 为什么这里是 i<output_c 呢,好像是这样的,记不太清楚原理了
val_bias[i] = 0.0;
if (bias.size() != 0) {
val_bias[i] = bias[i];
}
}
conBias.values = val_bias;
// 构建trt的卷积层,它自带了,所以用addConvolution来生成,后面batchnormal就没有
nvinfer1::IConvolutionLayer *conv = this->m_network->addConvolution(*this->Layers[inputLayerName], output_c,
nvinfer1::DimsHW(kernel, kernel), conWeights, conBias);
// IConvolutionLayer这个类自带了设置stride和padding
conv->setStride(nvinfer1::DimsHW(stride, stride));
conv->setPadding(nvinfer1::DimsHW(padding, padding));
this->print_tensor_size("conv", conv->getOutput(0));
return conv->getOutput(0); // 感觉是数组那种,给0返回的首地址吧
}
nvinfer1::ITensor* tensorRT::trt_batchnormal(std::string inputLayerName, std::string weightsName) {
/*
batchnormal中有几个权重文件:weight、bias、mean、var
tensorrt中因为没有batchnormal这层,所以是用它自带的Scale层来改编的。就是this->m_network->addScale(),
所有要理解scale、batchnormal的底层公式,才能知道它的转你换,具体公式不写了,视频02的20的左右有
*/
// ..../bn1.weight.wgt
std::string weightsPath = this->rootPath + weightsName + ".weight.wgt";
std::string biasPath = this->rootPath + weightsName + ".bias.wgt";
std::string meanPath = this->rootPath + weightsName + ".running_mean.wgt";
std::string varPath = this->rootPath + weightsName + ".running_var.wgt";
std::vector<float> weights = this->loadWeoghts(weightsPath);
std::vector<float> bias = this->loadWeoghts(biasPath);
std::vector<float> mean = this->loadWeoghts(meanPath);
std::vector<float> var = this->loadWeoghts(varPath);
int size = bias.size(); // 4个长度都一样,随便拿一个都一样
std::vector<float> bn_var; // 因为要用多次,这就单独写出来了
for (size_t i = 0; i < size; i++) {
bn_var.push_back(sqrt(var.at(i) + 1e-5)); // +1e-5是为了防止为0,这要后面要作为分母后
}
float *shiftWt = new float[size]; // 这是声明数组,必须数组,后面要赋值给别人的
for (size_t i = 0; i < size; i++) {
// 这里公式是:shift = b - (mean*w)/sqrt(var+1e-5);所以bn_var这个vector可以不要的,下面直接写的
shiftWt[i] = bias[i] - ((mean.at(i) * weights.at(i)) / bn_var.at(i));
}
float *scaleWt = new float[size];
float *powerWt = new float[size];
for(size_t i = 0; i < size; i++) {
scaleWt[i] = weights.at(i) / bn_var.at(i); // 公式,上面写了说明了
powerWt[i] = 1.0;
}
nvinfer1::Weights shift{nvinfer1::DataType::kFLOAT, nullptr, size};
nvinfer1::Weights scale{nvinfer1::DataType::kFLOAT, nullptr, size};
nvinfer1::Weights power{nvinfer1::DataType::kFLOAT, nullptr, size};
shift.values = shiftWt;
scale.values = scaleWt;
power.values = powerWt;
// batchnormal有一个通道的选择,因为我们用的scale的api,只是把它的数据改成了batchnormal的数据
nvinfer1::ScaleMode scaleMode = nvinfer1::ScaleMode::kCHANNEL;
nvinfer1::IScaleLayer *batchNormal = this->m_network->addScale(*this->Layers[inputLayerName], scaleMode, shift, scale, power);
this->print_tensor_size("batchnormal", batchNormal->getOutput(0));
return batchNormal->getOutput(0);
}
nvinfer1::ITensor* tensorRT::trt_activation(std::string inputLayerName, std::string activate_type) {
// 很多激活类型,就没写完了
nvinfer1::ActivationType ActivateType;
if (activate_type == "relu")
ActivateType = nvinfer1::ActivationType::kRELU; // 点进去这个枚举值,有很多
else if (activate_type == "sigmoid")
ActivateType = nvinfer1::ActivationType::kSIGMOID;
else if (activate_type == "tanh")
ActivateType = nvinfer1::ActivationType::kTANH;
else if (activate_type == "elu")
ActivateType = nvinfer1::ActivationType::kELU;
else if (activate_type == "l_relu")
ActivateType = nvinfer1::ActivationType::kLEAKY_RELU;
else if (activate_type == "clip")
ActivateType = nvinfer1::ActivationType::kCLIP;
nvinfer1::IActivationLayer *activate = this->m_network->addActivation(*this->Layers[inputLayerName], ActivateType);
// 比如leak_relu时,要传入一个alpha参数,这就要设置
if (activate_type == "l_relu") {
activate->setAlpha(0.001); // 可设置成类成员变量,传进来
}
if (activate_type == "clip") {
activate->setAlpha(0.1);
activate->setBeta(0.9); // 数值我随便给的,去看说明给
}
this->print_tensor_size(activate_type, activate->getOutput(0));
return activate->getOutput(0);
}
nvinfer1::ITensor* tensorRT::trt_pool(std::string inputLayerName, std::string pool_type, int kernel, int stride, int padding) {
nvinfer1::PoolingType PoolType;
if (pool_type == "max") {
PoolType = nvinfer1::PoolingType::kMAX;
}
else if (pool_type == "average") {
PoolType = nvinfer1::PoolingType::kAVERAGE;
}
nvinfer1::IPoolingLayer *pool = this->m_network->addPooling(*this->Layers[inputLayerName], PoolType, nvinfer1::DimsHW(kernel, kernel));
pool->setStride(nvinfer1::DimsHW(stride, stride));
pool->setPadding(nvinfer1::DimsHW(padding, padding));
this->print_tensor_size(pool_type + "pool", pool->getOutput(0));
return pool->getOutput(0);
}
nvinfer1::ITensor* tensorRT::trt_calculate(std::string inputLayerName1, std::string inputLayerName2, std::string cal_type) {
/*
两个tensor相加也不仅仅是简单的相加,也是搞一个相加层,跟上面的batchnormal、卷积层是一样的
*/
nvinfer1::ElementWiseOperation CalType;
if (cal_type == "add") {
CalType = nvinfer1::ElementWiseOperation::kSUM;
}
else if (cal_type == "divide") {
CalType = nvinfer1::ElementWiseOperation::kDIV;
}
else if (cal_type == "multiply") {
CalType = nvinfer1::ElementWiseOperation::kPROD; // 两个矩阵相乘
}
// 注意下面这个类型(所有这种layer层的类型,前面开头都是I)
nvinfer1::IElementWiseLayer *eltiswe = this->m_network->addElementWise(*this->Layers[inputLayerName1], *this->Layers[inputLayerName2], CalType);
this->print_tensor_size(cal_type, eltiswe->getOutput(0));
return eltiswe->getOutput(0);
}
// fc:全连接
nvinfer1::ITensor* tensorRT::trt_fc(std::string inputLayerName, std::string weightsName, std::string biasName, int out_features) {
std::vector<float> weights = this->loadWeoghts(this->rootPath + weightsName);
std::vector<float> bias;
if (biasName != "") {
bias = this->loadWeoghts(this->rootPath + biasName);
}
unsigned int size = weights.size();
float *fc_weights = new float[size];
for (int i = 0; i < size; i++) {
fc_weights[i] = weights.at(i);
}
float *fc_bias = new float[out_features]; // 注意是output_C
for (int i = 0; i < out_features; i++) { // 注意这里是:i < output_C 而不是size
fc_bias[i] = 0.0; // 相当于给fc_bias中数据初始化
if (bias.size() != 0) {
fc_bias[i] = bias.at(i);
}
}
nvinfer1::Weights fc_wt{nvinfer1::DataType::kFLOAT, nullptr, size};
nvinfer1::Weights fc_bs{nvinfer1::DataType::kFLOAT, nullptr, out_features};
fc_wt.values = fc_weights;
fc_bs.values = fc_bias;
// fc:全连接层
nvinfer1::IFullyConnectedLayer *fc = this->m_network->addFullyConnected(*this->Layers[inputLayerName], out_features, fc_wt, fc_bs);
return fc->getOutput(0);
}
nvinfer1::ITensor* tensorRT::trt_matmul(std::string inputLayerName1, std::string inputLayerName2) {
nvinfer1::MatrixOperation dtype = nvinfer1::MatrixOperation::kNONE; // 这代表不转置,一把就是把矩阵处理好了再来相乘
nvinfer1::IMatrixMultiplyLayer *matmul = this->m_network->addMatrixMultiply(*this->Layers[inputLayerName1], dtype, *Layers[inputLayerName2], dtype);
return matmul->getOutput(0);
}
nvinfer1::ITensor* tensorRT::trt_softmax(std::string inputLayerName, int dim) {
nvinfer1::ISoftMaxLayer *softmax = this->m_network->addSoftMax(*this->Layers[inputLayerName]);
return softmax->getOutput(dim); // dim去看.h的说明
}
nvinfer1::ITensor* tensorRT::trt_concate(std::vector<std::string> inputLayerNames, int axis) {
int nbinputs = inputLayerNames.size();
// new一个数组,把数据拿到
nvinfer1::ITensor* *inputs = new nvinfer1::ITensor* [nbinputs];
for (int i = 0; i < nbinputs; ++i) {
inputs[i] = this->Layers[inputLayerNames.at(i)];
}
nvinfer1::IConcatenationLayer *concate = this->m_network->addConcatenation(inputs,nbinputs); // nbinputs就是前面这个inputs数组的长度
oncate->setAxis(axis); // 设置哪个维度concate
return concate->getOutput(0);
}
nvinfer1::ITensor* tensorRT::trt_slice(std::string inputLayerName, std::vector<int> start, std::vector<int> outputSize, std::vector<int> step) {
nvinfer1::Dims start_dim = nvinfer1::Dims{start[0], start[1], start[2]};
nvinfer1::Dims output_dim = nvinfer1::Dims{outputSize[0], outputSize[1], outputSize[2]}; // 不能用圆括号初始化
nvinfer1::Dims step_dim = nvinfer1::Dims{step[0], step[1], step[2]};
nvinfer1::ISliceLayer *slice = this->m_network->addSlice(*this->Layers[inputLayerName], start_dim, output_dim, step_dim);
return slice->getOutput(0);
}
nvinfer1::ITensor* tensorRT::trt_shuffle(std::string inputLayerName, std::vector<int> reshapeSize, std::vector<int> permuteSize) {
// reshapereshapeSize的vector一般是 {3, 128, 128}这种代表形状的vector,可能是4维的
// 这是为了rehsape
int size = reshapeSize.size();
this->m_shuffle.reshape.nbDims = size;
for (int i = 0; i < size; ++i) {
this->m_shuffle.reshape.d[i] = reshapeSize.at(i);
}
// 这是为了permute
size = permuteSize.size();
for (int i = 0; i < size; ++i) {
this->m_shuffle.permute.order[i] = permuteSize.at(i);
}
nvinfer1::IShuffleLayer *shuffle = this->m_network->addShuffle(*Layers[inputLayerName]);
// 这三个值我只是暂时都初始化为true,真实用的时候一般取一种就行,会用就好
bool only_reshape = true, only_permute = true, both = true;
if (only_reshape)
shuffle->setReshapeDimensions(this->m_shuffle.reshape);
if (only_permute)
shuffle->setFirstTranspose(m_shuffle.permute);
if (both) {
// 两种操作都做的话,就要决定先reshape还是先transpose
bool reshape_first = true;
if (reshape_first) {
shuffle->setReshapeDimensions(m_shuffle.reshape);
shuffle->setSecondTranspose(m_shuffle.permute);
}
else {
shuffle->setFirstTranspose(m_shuffle.permute);
shuffle->setReshapeDimensions(m_shuffle.reshape);
}
}
return shuffle->getOutput(0);
}
nvinfer1::ITensor* tensorRT::trt_constant(std::vector<int> dimensions, float alpha) {
int all = 1;
nvinfer1::Dims Dims;
Dims.nbDims = dimensions.size();
for (int i = 0; i < dimensions.size(); ++i) {
all *= i; // 这写的有点问题吧,一开始i为0,*=不就一直为0了
Dims.d[i] = dimensions.at(i);
}
nvinfer1::Weights weights{nvinfer1::DataType::kFLOAT, nullptr, all};
float *val = new float[all];
for (int i = 0; i < all; ++i) {
val[i] = alpha;
}
weights.values = val;
nvinfer1::IConstantLayer *constant = this->m_network->addConstant(Dims, weights);
return constant->getOutput(0);
}
#include <iostream>
#include <NvInfer.h>
#include <driver_types.h> // cudaError_t 需要(似乎只要下面这个),,还要cudart.lib
#include <cuda_runtime_api.h> // cudaGetDeviceCount 需要,,还要 cudart.lib
#include "tensorrt.h"
int main() {
int cudaNum = 0;
cudaError_t error = cudaGetDeviceCount(&cudaNum);
if (cudaSuccess != error) return 0;
if (cudaNum <= 0) return 0;
int idx = 0;
if (cudaNum > 1) {
std::cout << "please choose the GPU idnex: " << std::endl;
std::cin >> idx;
if (idx >= cudaNum)
idx = cudaNum - 1;
else if (idx < 0)
idx = 0;
}
cudaSetDevice(idx);
cudaFree(nullptr);
// 构建.engine
tensorRT *trt = new tensorRT();
trt->createENG("E:/project/Pycharm_project/trt_study/resnet18.engine");
std::cout << "Hello World!" << std::endl;
return 0;
}
注:根据以上的步骤,是肯定可以构建.engine引擎文件的,是绝对能编译成功,运行出结果的。
-
首先在tensorrt.h的tensorRT类中将推理相关的函数和成员属性定义一下:
class tensorRT { public: /*.........*/ // 下面开始推理的代码部分 void Inference_init(const std::string &engPath, int batchsize); void doInference(const float *input, int batchsize, float *output); nvinfer1::ICudaEngine *engine; // 定义在这,方便释放,感觉好像不用释放 int inputSize = 3 * 256 * 256; // 还没管batchsize,前面定义的图片的大小 int outputSize = 1000; // 1000*1*1 int inputIdx, outputIdx; std::vector<void *> m_bindings; // 说是所有的输入输出都会放这里面 nvinfer1::IExecutionContext *m_context; // 一直要用的上下文 cudaStream_t m_cudaStream; };
-
上面定义的函数的实现(这里面才是相当重要的),在tensorrt.cpp:
void tensorRT::Inference_init(const std::string &engPath, int batchsize) { // 就是读二进制文件 std::ifstream cache(engPath, std::ios::binary); cache.seekg(0, std::ifstream::end); // std::ios::end 把流整到末尾去了(用std::ios::ate | std::ios::binary 的方式打开,指针直接就在流尾部了) const int engSize = cache.tellg(); // 移动到末尾,然后tellg()告诉位置就知道大小了 // std::ifstream::pos_type mark = cache.tellg(); // (int)mark是等于engSize的 // 知道大小后就移动回流开始的位置 cache.seekg(0, std::ios::beg); // 也有看到写 cache.beg、cache.end 这种,一个意思 void *modelMem = malloc(engSize); cache.read((char *)modelMem, engSize); // 等下打印看看 engSize 、 mark 、 sizeof() cache.close(); // 上面创建引擎是是build,要推理就要runtime nvinfer1::IRuntime *runtime = nvinfer1::createInferRuntime(this->m_logger); // 反序列化出来,因为没有自定义plugin层,所以第三个参数是nullptr this->engine = runtime->deserializeCudaEngine(modelMem, engSize, nullptr); // 说是以上就把引擎反序列化到GPU里面,然后就可以释放了 runtime->destroy(); free(modelMem); if (!engine) return; // 反序列化后,就要 malloc 输出输出空间 this->m_context = engine->createExecutionContext(); // 这其实也是在初始化his->m_cudaStream,同时加的判断,没有这个就不行 if (cudaStreamCreate(&this->m_cudaStream) != 0) return; int bindings = engine->getNbBindings(); this->m_bindings.resize(bindings, nullptr); // 初始化这个vector this->inputIdx = engine->getBindingIndex("data"); // 前面创建引擎时标记的“data” // cudaMalloc需要头文件 <cuda_runtime_api.h> int flag = cudaMalloc(&this->m_bindings.at(inputIdx), batchsize * this->inputSize * sizeof(float)); // 注意这分配空间的大小 if (flag != 0) { std::cout << "malloc error!" <<std::endl; return; } this->outputIdx = engine->getBindingIndex("output"); // 创建.engine文件最后也是标记了输出为output flag = cudaMalloc(&this->m_bindings.at(outputIdx), batchsize * this->outputSize * sizeof(float)); if (flag != 0) { std::cout << "malloc error!" <<std::endl; return; } } void tensorRT::doInference(const float *input, int batchsize, float *output) { int flag; // 1.0把input拷贝到m_binding指定的位置,cudaMemcpyHostToDevice代表内存到显存,最后一个是固定需要的 flag = cudaMemcpyAsync(this->m_bindings.at(this->inputIdx), input, batchsize*this->inputSize*sizeof(float), cudaMemcpyHostToDevice,this->m_cudaStream); if (flag != 0) { std::cout << "input copy to cuda error!" << std::endl; return; } // 2.0定义的上下文开始推理,并把结果存到m_binding指定位置 // a_vector.data() 得到的首地址,等同于 &(*a_vec.begin()) this->m_context->enqueue(batchsize, this->m_bindings.data(), this->m_cudaStream, nullptr); // 3.0再把结果从显存拷贝回内存里 flag = cudaMemcpyAsync(output, this->m_bindings.at(this->outputIdx), batchsize*this->outputSize*sizeof(float), cudaMemcpyDeviceToHost, this->m_cudaStream); if (flag != 0) { std::cout << "output copy to mem error!" << std::endl; return; } cudaStreamSynchronize(this->m_cudaStream); // 进程跑起来就行了 } // 析构函数释放资源 tensorRT::~tensorRT() { if (this->m_context) { m_context->destroy(); m_context = nullptr; } if (this->engine) { engine->destroy(); engine = nullptr; } for (auto bindings : this->m_bindings) { cudaFree(bindings); } }
-
在main.cpp中:(需要opencv的库,记得其.dll文件路径要添加到环境变量,对应的.pro也要去设置头文件、库文件路径)
/*...*/ #include <opencv2/core/core.hpp> #include <opencv2/dnn/dnn.hpp> #include <opencv2/imgcodecs/imgcodecs.hpp> #include <opencv2/imgproc/imgproc.hpp> int main() { /*.....*/ tensorRT *trt = new tensorRT(); // 生成一次就可以了 // trt->createENG("E:/project/Pycharm_project/trt_study/resnet18.engine"); trt->Inference_init("E:/project/Pycharm_project/trt_study/resnet18.engine", 10); // 下面是将一张图片 cv::Mat image = cv::imread("E:/project/Pycharm_project/trt_study/1.jpg"); cv::Mat blob = cv::dnn::blobFromImage(image, 1.0, cv::Size(256, 256), cv::Scalar(127.0, 127.0, 127.0), true, false); float *input = new float[1*3*256*256]; // 输入一张图 memcpy(input, blob.data, 1*3*256*256*sizeof(float)); float *output = new float[1*1000*1*1]; trt->doInference(input, 1, output); for (int i = 0; i < 1000; i++) { std::cout << i << ": " << output[i] << std::endl; } /* 这个输出效果和Python的网络输出效果来对比,几乎结果是一样的,Python代码 model = torchvision.models.resnet18(pretrained=False) model.load_state_dict(torch.load("./resnet18.pth")) model.cuda() model.eval() image = cv2.imread("./1.jpg") blob = cv2.dnn.blobFromImage(image, 1.0, (256, 256), (127.0, 127.0, 127.0), True, False) input_data = torch.Tensor(blob).cuda() output = model(input_data) print(output) */ return 0; }
在3.3.1来说,整个网络就是输出了一个结果,相当于输出了一个类别,但是很多时候还要输出目标的坐标位置,就不止一个,那就要改进,结合3.3.1来看,只在部分函数上做了修改;
-
在函数 void tensorRT::createENG(std::string engPath) 中添加如下几行,重新生成.engine文件:
// 新增的一个输出output1(随便写,这里就是把输出层relu了一下作为新的输出) this->Layers["relu_eng"] = this->trt_activation("fc", "relu"); this->Layers["relu_eng"]->setName("output1"); // 注意名字和上面的区分开 this->m_network->markOutput(*this->Layers["relu_eng"]);
-
在tensorrt.h中增加一些属性和函数:
class tensorRT { public: /*.......................*/ // 下面是两个输出 int outputs[2] = {1000, 1000}; // 输出size不同就改这里 std::vector<int> outputIndexs; int alloutputsize = 2000; // 把所有输出总量1000+1000这里写下,方便整个开辟空间 void *temp; // 用来存临时变量的 // 两个(可拓展为多个)输出 void doInferences_two(const float *input, int batchsize, float *output); };
-
推理时,引擎初始化要修改:
void tensorRT::Inference_init(const std::string &engPath, int batchsize) { /*..................*/ /* 这是单个输出的代码: this->outputIdx = engine->getBindingIndex("output"); // 创建.engine文件最后也是标记了输出为output flag = cudaMalloc(&this->m_bindings.at(outputIdx), batchsize * this->outputSize * sizeof(float)); if (flag != 0) { std::cout << "malloc error!" <<std::endl; return; } */ // 两输出,那就申请两个空间 this->outputIndexs.push_back(engine->getBindingIndex("output")); this->outputIndexs.push_back(engine->getBindingIndex("output1")); for (int i =0; i < this->outputIndexs.size(); i++) { cudaMalloc(&this->m_bindings.at(this->outputIndexs.at(i)), batchsize * this->outputSize * sizeof(float)); } // 一定要这行,把整个输出都这样开辟一下空间,上面的也不能省 cudaMalloc(&this->temp, batchsize*this->alloutputsize*sizeof(float)); }
-
void doInferences_two(const float *input, int batchsize, float *output) 函数实现:
void tensorRT::doInferences_two(const float *input, int batchsize, float *output) { int flag; // 把input拷贝到m_binding指定的位置,cudaMemcpyHostToDevice代表内存到显存,最后一个是固定需要的 flag = cudaMemcpyAsync(this->m_bindings.at(this->inputIdx), input, batchsize*this->inputSize*sizeof(float), cudaMemcpyHostToDevice,this->m_cudaStream); if (flag != 0) { std::cout << "input copy to cuda error!" << std::endl; return; } // 定义的上下文开始推理,并把结果存到m_binding指定位置 // a_vector.data() 得到的首地址,等同于 &(*a_vec.begin()) this->m_context->enqueue(batchsize, this->m_bindings.data(), this->m_cudaStream, nullptr); /**** 以上跟单个输出是一样的 *************/ // 因为有两个输出了不能直接 cudaMemcpyHostToDevice 要搞个临时变量this->temp来存 int outNum = 0; int allNum = this->m_bindings.size(); // 这里面有输入、所有输出 // 从1开始,是因为[0]是input的data for (int i = 1; i < allNum; i++) { // 注意,这里还是DeviceToDevice,是在显存里操作, cudaMemcpyAsync((float*)this->temp + batchsize*outNum, this->m_bindings.at(this->outputIndexs[i-1]), batchsize*this->outputs[i-1]*sizeof(float), cudaMemcpyDeviceToDevice, this->m_cudaStream); outNum += this->outputs[i-1]; } flag = cudaMemcpyAsync(output, this->temp, batchsize*outNum*sizeof (float), cudaMemcpyDeviceToHost, this->m_cudaStream); if (flag != 0) { std::cout << "output copy to mem error!" << std::endl; return; } cudaStreamSynchronize(this->m_cudaStream); }
有两种方式:
- 在构建trt网络时,通过最初的神经网络权重计算好校准值,不需要生成校准表(教程说没用过这)
- 在tensorRT中通过继承重写类nvinfer1::IInt8Calibrator(类名千万别错了),实现生成int8校准表(用这)
然后原理我就不咋写了,好像代码是比较固定的,可以直接掏出来合着上面用。
-
calibrator.h:
#ifndef CALIBRATOR_H #define CALIBRATOR_H #include <NvInfer.h> #include <string> #include <vector> class Calibrator : public nvinfer1::IInt8EntropyCalibrator { public: Calibrator(const unsigned int &batchsize, const std::string &caliTxt, const std::string &calibratorPath, const uint64_t &inputSize, const unsigned int &inputH, const unsigned int &inputW, const std::string &inputName); int getBatchSize() const override; bool getBatch(void* bindings[], const char* names[], int nbBindings) override; const void* readCalibrationCache(size_t &length) override; void writeCalibrationCache(const void* ptr, std::size_t length) override; private: unsigned int m_batchsize; const unsigned int m_inputH; const unsigned int m_inputW; const uint64_t m_inputSize; const uint64_t m_inputCount; const char* m_inputName; const std::string m_calibratorPath{nullptr}; std::vector<std::string> m_ImageList; void *m_cudaInput{nullptr}; std::vector<char> m_calibrationCache; unsigned int m_ImageIndex; }; #endif // CALIBRATOR_H
-
calibrator.cpp:
#include "calibrator.h" #include <fstream> #include <iostream> #include <cuda_runtime_api.h> #include <opencv2/opencv.hpp> // 把存有每张图片的txt加载,得到一个vector // imgTxt是一个txt文本路径,里面放的校准图片的路径,示例在下面的“注”的第一点 std::vector<std::string> loadImage(const std::string &imgTxt) { std::vector<std::string> imgInfo; FILE *f = fopen(imgTxt.c_str(), "r"); if (!f) { perror("Error"); std::cout << "cant open file" << std::endl; return imgInfo; } char str[512]; while (fgets(str, 512, f) != NULL) { for (int i = 0; str[i] != '\0'; ++i) { if (str[i] == '\r') {str[i] = '\0';} if (str[i] == '\n') {str[i] = '\0'; break;} } imgInfo.push_back(str); } fclose(f); return imgInfo; } Calibrator::Calibrator(const unsigned int &batchsize, const std::string &caliTxt, const std::string &calibratorPath, const uint64_t &inputSize, const unsigned int &inputH, const unsigned int &inputW, const std::string &inputName) : m_batchsize(batchsize), m_inputH(inputH), m_inputW(inputW), m_inputSize(inputSize), m_inputCount(batchsize * inputSize), m_inputName(inputName.c_str()), m_calibratorPath(calibratorPath), m_ImageIndex(0) { this->m_ImageList = loadImage(caliTxt); cudaMalloc(&this->m_cudaInput, this->m_inputCount * sizeof (float)); } int Calibrator::getBatchSize() const { return this->m_batchsize; } bool Calibrator::getBatch(void **bindings, const char **names, int nbBindings) { if (this->m_ImageIndex + this->m_batchsize > this->m_ImageList.size()) return false; std::cout << this->m_batchsize <<std::endl; std::vector<cv::Mat> inputImages; for (unsigned int i = this->m_ImageIndex; i < m_ImageIndex+this->m_batchsize; i++) { std::string imgPath = this->m_ImageList.at(i); std::cout << imgPath << std::endl; cv::Mat temp = cv::imread(imgPath); if (temp.empty()) { std::cout << "img read error!" << std::endl; } inputImages.push_back(temp); } this->m_ImageIndex += this->m_batchsize; cv::Mat trtInput = cv::dnn::blobFromImages(inputImages, 1.0, cv::Size(m_inputH, m_inputW), cv::Scalar(127.0, 127.0, 127.0), true, false); cudaMemcpy(m_cudaInput, trtInput.ptr<float>(0), m_inputCount*sizeof (float), cudaMemcpyHostToDevice); bindings[0] = m_cudaInput; return true; } const void* Calibrator::readCalibrationCache(size_t &length) { // 如果有校准表就读取拿到,没有就返回一个空的指针(else中),后续去创建 void *output; this->m_calibrationCache.clear(); std::ifstream input(this->m_calibratorPath, std::ios::binary); input >> std::noskipws; if (input.good()) { std::copy(std::istream_iterator<char>(input), std::istream_iterator<char>(), std::back_inserter(this->m_calibrationCache)); } length = this->m_calibrationCache.size(); // 修改了传入的参数 if (length) { std::cout << "using cached calibration table to build the engine" << std::endl; output = &this->m_calibrationCache.at(0); } else { std::cout << "New calibration table will be created to build the engine" << std::endl; output = nullptr; } return output; } void Calibrator::writeCalibrationCache(const void *ptr, std::size_t length) { // ptr说是tensorrt中自己会去计算,因为这里有继承嘛 assert(!this->m_calibratorPath.empty()); std::cout << "length = " << length << std::endl; std::ofstream output(this->m_calibratorPath, std::ios::binary); output.write(reinterpret_cast<const char*>(ptr), length); output.close(); }
-
在tensorrt.cpp中引入这个头文件,在创建.engine文件是,看是否使用int
void tensorRT::createENG(std::string engPath) { /*.....................................*/ // 是否用int8 this->isInt8 = true; // 没在构造函数时初始化了,这里手动初始化下 if (this->isInt8) { const std::string caliTxt = "E:/project/Pycharm_project/trt_study/int8_pic/calibration.txt"; const std::string int8cali_table = "E:/project/Pycharm_project/trt_study/int8_pic/int8cal.table"; Calibrator *m_calbrator = new Calibrator(1, caliTxt, int8cali_table, 3*256*256, 256, 256, "data"); // 这个"data"是上面写定了的 builder->setInt8Mode(true); builder->setInt8Calibrator(m_calbrator); } }
注:
-
calibration.txt:这是自己写的里面的格式如下:一般校准是用自己的数据集的几千张来做的。
E:/project/Pycharm_project/trt_study/int8_pic/1.jpg E:/project/Pycharm_project/trt_study/int8_pic/4.jpg E:/project/Pycharm_project/trt_study/int8_pic/5.jpg E:/project/Pycharm_project/trt_study/int8_pic/6.jpg
-
int8cal.table:这是程序运行生成的校准表,第一次运行生成。(记事本可打开)
-
-
最后一步,在main函数中执行创造.engine的函数,设成int8的引擎文件。
这是解决没有的算子,尽量不自己写,自定义层加了速度可能会变慢(因为我们写的不好),尽量用已有的层去改造变量的方式
- 添加lib:-L'E:\lib\TensorRT-7.2.3.4.Windows10.x86_64.cuda-10.2.cudnn8.1\lib' nvinfer.lib nvifer_plugin.lib # 主要是需要 nvifer_plugin.lib
只能说放这里吧,代码有问题,跑不起来,主要是m_pluginfactory.h有问题,代码比这视频写的,但是有明显的报错,继承了太多次,我也看不明白了,是以l-relu来作为示例的
放这里吧,以后如果再看这个视频,来这里复制:
-
trt_demo.pro
TEMPLATE = app CONFIG += console c++11 CONFIG -= app_bundle CONFIG -= qt win32 { INCLUDEPATH += \ 'E:\lib\TensorRT-7.2.3.4.Windows10.x86_64.cuda-10.2.cudnn8.1\include' \ 'C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\include' \ 'E:\lib\opencv\build\include' } win32 { LIBS += \ -L'E:\lib\TensorRT-7.2.3.4.Windows10.x86_64.cuda-10.2.cudnn8.1\lib' nvinfer.lib nvifer_plugin.lib \ -L'C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\lib\x64' cudart.lib \ -L'E:\lib\opencv\build\x64\vc15\lib' opencv_world440d.lib } SOURCES += \ main.cpp \ tensorrt.cpp \ calibrator.cpp \ m_lrelu.cpp HEADERS += \ tensorrt.h \ calibrator.h \ m_lrelu.h \ m_pluginfactory.h CUDA_SOURCES += \ m_lrelu.cu # 在gpu上的操作写到这个文件里 # qt要写cu文件,需要下面的这些 win32 { SYSTEM_NAME = x64 SYSTEM_TYPE = 64 CUDA_ARCH = compute_35 CUDA_CODE = sm_35 # 说些根据GPU显卡型号来写 CUDA_INC = $$join(INCLUDEPATH, '" -I"','-I"','"') MSVCRT_LINK_FLAG_DEBUG = "/MDd" MSVCRT_LINK_FLAG_RELEASE = "/MD" # Configuration of the Cuda compiler CONFIG(debug, debug|release) { # Debug mode cuda.input = CUDA_SOURCES cuda.output = $$OBJECTS_DIR/${QMAKE_FILE_BASE}_cuda.obj cuda.commands = C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v10.2/bin/nvcc.exe -D_DEBUG -Xcompiler $$MSVCRT_LINK_FLAG_DEBUG -c -Xcompiler $$join(QMAKE_CXXFLAGS,",") $$join(INCLUDEPATH,'" -I "', '-I "', '"') ${QMAKE_FILE_NAME} -o ${QMAKE_FILE_OUT} } else { # Release mode cuda.input = CUDA_SOURCES cuda.output = $$OBJECTS_DIR/${QMAKE_FILE_BASE}_cuda.obj cuda.commands = C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v10.2/bin/nvcc.exe -Xcompiler $$MSVCRT_LINK_FLAG_RELEASE -c -Xcompiler $$join(QMAKE_CXXFLAGS,",") $$join(INCLUDEPATH,'" -I "', '-I "', '"') ${QMAKE_FILE_NAME} -o ${QMAKE_FILE_OUT} } }
-
m_lrelu.h
#ifndef M_LRELU_H #define M_LRELU_H #include <NvInfer.h> #include <cuda_runtime_api.h> #include <iostream> #include <assert.h> namespace nvinfer1 { // 需要继承这个类来写 class m_Lrelu : public nvinfer1::IPluginExt { public: explicit m_Lrelu(const float alpha, const int cudaThread, DataType type); m_Lrelu(const void* buffer, size_t size); ~m_Lrelu() override; int getNbOutputs() const override; Dims getOutputDimensions(int index, const Dims *inputs, int nbInputDims) override; bool supportsFormat(DataType type, PluginFormat format) const override; void configureWithFormat(const Dims *inputDims, int nbInputs, const Dims *outputDims, int nbOutputs, DataType type, PluginFormat format, int maxBatchSize) override; int initialize() override; size_t getWorkspaceSize(int maxBatchSize) const override; // 推理的时候主要是自动调用这个函数 int enqueue(int batchSize, const void* const* inputs, void** outputs, void* workspace, cudaStream_t stream) override; size_t getSerializationSize() override; void serialize(void* buffer) override; void terminate() override; void lReluForward(const int n, const float *input, float *output, const float alpha); private: float m_alpha; int m_ThreadCount; nvinfer1::Dims m_CHW; int m_C; int m_H; int m_W; int m_inputSize; DataType m_dataType; }; } #endif // M_LRELU_H
-
m_lrelu.cpp
#include "m_lrelu.h" namespace nvinfer1 { template<typename T> void read(const char* &buffer, T &val) { val = *reinterpret_cast<const T*>(buffer); buffer += sizeof(T); } template<typename T> void write(char* &buffer, const T &val) { *reinterpret_cast<T*>(buffer) = val; buffer += sizeof(T); } m_Lrelu::m_Lrelu(const float alpha, const int cudaThread, DataType type) : m_alpha(alpha), m_ThreadCount(cudaThread), m_dataType(type) { } m_Lrelu::m_Lrelu(const void* buffer, size_t size) { const char *d = reinterpret_cast<const char*>(buffer), *a = d; read(d, m_alpha); read(d, m_CHW); read(d, m_C); read(d, m_H); read(d, m_W); read(d, m_inputSize); read(d, m_dataType); read(d, m_ThreadCount); assert(d == a + size); } m_Lrelu::~m_Lrelu() {} int m_Lrelu::getNbOutputs() const { return 1; } Dims m_Lrelu::getOutputDimensions(int index, const Dims *inputs, int nbInputDims) { this->m_CHW = inputs[0]; // 相当于拿的第一个数据,n是1 this->m_C = m_CHW.d[0]; this->m_H = m_CHW.d[1]; this->m_W = m_CHW.d[2]; this->m_inputSize = m_C * m_H * m_W; return Dims3(m_C, m_H, m_W); } bool m_Lrelu::supportsFormat(DataType type, PluginFormat format) const { return (type == DataType::kFLOAT || type == DataType::kHALF || type == DataType::kINT8) && format == PluginFormat::kNCHW; } void m_Lrelu::configureWithFormat(const Dims *inputDims, int nbInputs, const Dims *outputDims, int nbOutputs, DataType type, PluginFormat format, int maxBatchSize) { assert((type == DataType::kFLOAT || type == DataType::kHALF || type == DataType::kINT8) && format == PluginFormat::kNCHW); } // 继承的虚函数没用到,就把重载写这里,但其实没有任何功能实现 int m_Lrelu::initialize() {return 0;} // 在getOutputDimensions函数里已经写了,也可以把那里面的初始化代码放这里 void m_Lrelu::terminate() {} size_t m_Lrelu::getWorkspaceSize(int maxBatchSize) const {return 0;} size_t m_Lrelu::getSerializationSize() { return sizeof(m_alpha) + sizeof(m_CHW) + sizeof(m_C) + sizeof(m_H) + sizeof(m_W) + sizeof(m_inputSize) + sizeof(m_dataType) + sizeof(m_ThreadCount); } void m_Lrelu::serialize(void *buffer) { char *d = static_cast<char*>(buffer), *a = d; write(d, m_alpha); write(d, m_CHW); write(d, m_C); write(d, m_H); write(d, m_W); write(d, m_inputSize); write(d, m_dataType); write(d, m_ThreadCount); assert(d == a + this->getSerializationSize()); } // 这是要调用GPU的 int m_Lrelu::enqueue(int batchSize, const void *const *inputs, void **outputs, void *workspace, cudaStream_t stream) { const int count = batchSize * m_inputSize; const float *input_data = reinterpret_cast<const float*>(inputs[0]); float *output_data = reinterpret_cast<float*>(outputs[0]); this->lReluForward(count, input_data, output_data, this->m_alpha); // 这应该是父类中的函数 return 0; }
-
m_pluginfactory.h # 报错的意思是说 createPlugin 函数找不到其要重写的版本,这里被编译器当做了声明,说是有override就是错的
#ifndef M_PLUGINFACTORY_H #define M_PLUGINFACTORY_H #include <NvInfer.h> #include <NvInferPlugin.h> #include "m_lrelu.h" #include <memory> #include <vector> #include <iostream> using namespace std; using nvinfer1::plugin::INvPlugin; using nvinfer1::m_Lrelu; class m_pluginFactory : public nvinfer1::IPluginFactory { // 这的nvinfer1这个namespace是自己头文件里定义的 nvinfer1::m_Lrelu* createPlugin(const char* layerName, const void* serialData, size_t serialLength) ovverride { m_Lrelu_Layers.emplace_back(std::unique_ptr<nvinfer1::m_Lrelu>(new nvinfer1::m_Lrelu(serialData, serialLength))); return m_Lrelu_Layers.back().get(); } void destroyPlugin() { for (auto &item: m_Lrelu_Layers) { item.reset(); } } std::vector<std::unique_ptr<nvinfer1::m_Lrelu> > m_Lrelu_Layers{}; }; #endif // M_PLUGINFACTORY_H
-
然后就要在tensorrt.cpp中实现l-relu的层
#include "m_lrelu.h" #include "m_pluginfactory.h" /*.....*/ // leak_relu层的实现 nvinfer1::ITensor* tensorRT::trt_Lrelu(std::string inputLayerName, const float alpha) { nvinfer1::DataType dtype = nvinfer1::DataType::kFLOAT; // 说是如果用的int8,它会自己转过去 nvinfer1::IPluginExt *lrelu = new nvinfer1::m_Lrelu(alpha, 512, dtype); // 这里的nvinfer1是自己头文件里的命令空间 // 添加plugin层时,注意这里使用的类型和函数名,注意第一个参数是取地址,和上面解引用有些不同 nvinfer1::IPluginLayer *m_lrelu = this->m_network->addPluginExt(&this->Layers[inputLayerName], 1, *lrelu); return m_lrelu->getOutput(0); }
然后在 tensorRT::Inference_init 函数中加入:
// 反序列化出来,因为没有自定义层,所以第三个参数是nullptr (这是没加入plugin层的时候) //this->engine = runtime->deserializeCudaEngine(modelMem, engSize, nullptr); // 加入自己的 plugin层 nvinfer1::IPluginFactory *m_plugin = new m_pluginFactory(); this->engine = runtime->deserializeCudaEngine(modelMem, engSize, m_plugin); // 第三个参数就加入plugin层
-
m_lrelu.cu里的代码
#include "m_lrelu.h" #define CUDA_KERNEL_LOOP(i,n) for(size_t i = blockIdx.x*blockDim.x + threadIdx.x; i < (n); i += blockDim.x*gridDim.x) namespace nvinfer1 { __global__ void lRelu(const int n, const float *input, float *output, const float alpha) { CUDA_KERNEL_LOOP(index, n) { // leak_relu 的 算法 output[index] = input[index] > 0 ? input[index] : input[index] * alpha; } } void m_Lrelu::lReluForward(const int n, const float *input, float *output, const float alpha) { // 说是带有线程那部分说是让所有线程都不会闲的 lRelu<<<(n + m_ThreadCount - 1) / m_ThreadCount, m_ThreadCount>>>(n, input, output, alpha); } }
-
最终去main.cpp中编译重新生成.engine和使用。
用到的机会应该也不大,我会去跟着写一下yolov5的tensorrt,然后这次学习的视频和相应的文件,代码,模型都放到阿里云盘上,万一以后用到就作为参考吧。
可用pycharm本地加载yolov5的模型(需要yolov5源码中的“models”、“utils”模块),然后debug“gen_wts.py”这个文件,就能很好的看到它的整个结构,写model.cpp就会清晰很多。
主要是几个API随着版本的更新,我放这里:
主要是model.cpp中:
/*
要理解yolov5的一个结构,才能更好的理解网络结构代码,参看这篇博客:
https://blog.csdn.net/wq_0708/article/details/121472274
==addConvolutionNd 和 addConvolution的区别==, chatGPT的回答: (还有pool带Nd的)
addConvolutionNd支持任意维度的卷积,而addConvolution只支持二维卷积。
addConvolutionNd可以设置更丰富的卷积参数,如卷积核大小、步长、填充大小等,而addConvolution只能设置卷积核大小、步长和填充大小。
addConvolutionNd可以设置更多的卷积选项,如dilation、groups、bias等,而addConvolution只支持bias选项。
因此,如果需要进行多维卷积或者设置更丰富的卷积参数和选项,就可以使用addConvolutionNd。如果只需要进行二维卷积且不需要设置太多参数和选项,就可以使用addConvolution。
TensorRT中的==createNetwork和createNetworkV2==都是创建网络的函数,区别在于:
createNetwork是TensorRT 5及之前版本使用的函数,而createNetworkV2是从TensorRT 6开始使用的新函数。
createNetwork只能创建一个网络,而createNetworkV2可以创建多个网络,这样可以更好地支持多个网络之间的共享层。
createNetworkV2可以设置更多的网络选项,如设置网络运行的最大批量大小,设置网络是否支持动态批量大小等。
createNetworkV2在API设计上更加清晰和简洁,方法和参数命名更加一致和简单。
因此,如果使用TensorRT 5及之前版本,可以使用createNetwork来创建网络。如果使用TensorRT 6及之后版本,推荐使用createNetworkV2来创建网络,因为它支持更多的功能和选项,更加灵活和方便。
*/