当前位置：首页 > ai >正文

KuiperInfer第八课-实现resnet推理

ai 2025/7/19 7:05:23

kuiperinfer跟学第八课-实现Resnet网络推理

模型执行函数

首先在第三课的学习中，已经学习了RuntimeGraph，也就是运行时的计算图，其实就是对pnnx::grapy的封装，以使得更加适配模型推理的流程。

/// 计算图结构，由多个计算节点和节点之间的数据流图组成
class RuntimeGraph {public:/*** 初始化计算图* @param param_path 计算图的结构文件* @param bin_path 计算图中的权重文件*/RuntimeGraph(std::string param_path, std::string bin_path);/*** 设置权重文件* @param bin_path 权重文件路径*/void set_bin_path(const std::string &bin_path);/*** 设置结构文件* @param param_path  结构文件路径*/void set_param_path(const std::string &param_path);/*** 返回结构文件* @return 返回结构文件*/const std::string &param_path() const;/*** 返回权重文件* @return 返回权重文件*/const std::string &bin_path() const;/*** 计算图的初始化* @return 是否初始化成功*/bool Init();const std::vector<std::shared_ptr<RuntimeOperator>> &operators() const;/*** 构建计算图* @param input_name 计算图输入节点的名称* @param output_name  计算图输出节点的名称*/void Build(const std::string &input_name, const std::string &output_name);const std::vector<std::shared_ptr<RuntimeOperator>> &get_topo_queues() const;/*** 根据计算图中的计算节点来返回Layer* @param op 计算图中的计算节点* @return 创建成功的Layer*/static std::shared_ptr<Layer> CreateLayer(const std::shared_ptr<RuntimeOperator> &op);std::vector<std::shared_ptr<Tensor<float>>> Forward(const std::vector<std::shared_ptr<Tensor<float>>> &inputs, bool debug);private:/*** 初始化kuiper infer计算图节点中的输入操作数* @param inputs pnnx中的输入操作数* @param runtime_operator 计算图节点*/static void InitGraphOperatorsInput(const std::vector<pnnx::Operand *> &inputs,const std::shared_ptr<RuntimeOperator> &runtime_operator);/*** 初始化kuiper infer计算图节点中的输出操作数* @param outputs pnnx中的输出操作数* @param runtime_operator 计算图节点*/static void InitGraphOperatorsOutput(const std::vector<pnnx::Operand *> &outputs,const std::shared_ptr<RuntimeOperator> &runtime_operator);/*** 初始化kuiper infer计算图中的节点属性* @param attrs pnnx中的节点属性* @param runtime_operator 计算图节点*/static voidInitGraphAttrs(const std::map<std::string, pnnx::Attribute> &attrs,const std::shared_ptr<RuntimeOperator> &runtime_operator);/*** 初始化kuiper infer计算图中的节点参数* @param params pnnx中的参数属性* @param runtime_operator 计算图节点*/static voidInitGraphParams(const std::map<std::string, pnnx::Parameter> &params,const std::shared_ptr<RuntimeOperator> &runtime_operator);void ReverseTopo(const std::shared_ptr<RuntimeOperator> &root_op);/*** 探查下一层的计算节点* @param current_op 当前计算节点* @param layer_output_data 当前节点的输出，赋予到下一层计算节点的输入张量中*/static void ProbeNextLayer(const std::shared_ptr<RuntimeOperator> &current_op,const std::vector<std::shared_ptr<Tensor<float>>> &layer_output_data);private:enum class GraphState {NeedInit = -2,NeedBuild = -1,Complete = 0,};public:/*** 返回模型当前的状态* @return 返回模型当前的状态*/GraphState graph_state() const;private:GraphState graph_state_ = GraphState::NeedInit;std::string input_name_;  /// 计算图输入节点的名称std::string output_name_; /// 计算图输出节点的名称std::string param_path_;  /// 计算图的结构文件std::string bin_path_;    /// 计算图的权重文件std::vector<std::shared_ptr<RuntimeOperator>> operators_;std::map<std::string, std::shared_ptr<RuntimeOperator>> operators_maps_;std::vector<std::shared_ptr<RuntimeOperator>> topo_operators_;std::unique_ptr<pnnx::Graph> graph_; /// pnnx的graph
};

RuntimeGraph的构建之前已经讲过，也就是我们已经得到了计算图中整个执行顺序，这节课重点看一下计算图的前向传播过程，也就是RuntimeGraph::Forward函数

std::vector<std::shared_ptr<Tensor<float>>> RuntimeGraph::Forward(const std::vector<std::shared_ptr<Tensor<float>>>& inputs, bool debug)

Forward函数获取输入，返回输出数据，核心就是依次调用执行顺序中每个算子的Forward函数，并组织好前一个算子的输出和后一个算子的输入，也就是前一个算子的输出是后一个算子的输入，一个算子可能有多个输入，但只有一个输出，一个算子的输出可能是多个算子的输入，过程如下面这个for循环所示：

for (const auto& current_op : topo_operators_) {if (current_op->type == "pnnx.Input") {current_op->has_forward = true;ProbeNextLayer(current_op, inputs);} else if (current_op->type == "pnnx.Output") {current_op->has_forward = true;CHECK(current_op->input_operands_seq.size() == 1);current_op->output_operands = current_op->input_operands_seq.front();} else {InferStatus status = current_op->layer->Forward();CHECK(status == InferStatus::kInferSuccess)<< current_op->layer->layer_name()<< " layer forward failed, error code: " << int(status);current_op->has_forward = true;ProbeNextLayer(current_op, current_op->output_operands->datas);}
}

计算图的输入输出节点要单独开个分支处理，其中核心部分是如何把算子的输入和输出对接起来，也就是ProbeNextLayer函数的内容，它以当前算子和当前算子输出数据作为参数：

void RuntimeGraph::ProbeNextLayer(const std::shared_ptr<RuntimeOperator> &current_op,const std::vector<std::shared_ptr<Tensor<float>>> &layer_output_datas) {// 当前节点的后继节点next_opsconst auto &next_ops = current_op->output_operators;// 对所有后继节点进行遍历for (const auto &[_, next_rt_operator] : next_ops) {// 得到后继节点的输入next_input_operandsconst auto &next_input_operands = next_rt_operator->input_operands;// 确定后继节点的输入来自于current_opif (next_input_operands.find(current_op->name) !=next_input_operands.end()) {// 得到后继节点的关于current_op输出的输入空间 next_input_datas/*** next_input_operands:* {*    输入1 -- current_op.name: current_op对应的输出空间*    输入2 -- other_op.name: other_op对应的输出空间* }*/std::vector<std::shared_ptr<ftensor>> &next_input_datas =next_input_operands.at(current_op->name)->datas;CHECK(next_input_datas.size() == layer_output_datas.size());// 将当前current_op的输出赋值到next_input_datas中for (int i = 0; i < next_input_datas.size(); ++i) {next_input_datas.at(i) = layer_output_datas.at(i);}}}
}

ProbeNextLayer函数中，对当前节点所有后继节点进行依次遍历，并将当前节点的输出赋值给后继节点的输入。注意，后继节点可能有多个前驱节点，所以后继节点的输入const auto &next_input_operands = next_rt_operator->input_operands;是一个map类型，key是前驱节点的名字，值是该前驱节点的输出操作数RuntimeOperand，即：

std::map<std::string, std::shared_ptr<RuntimeOperand>>input_operands;  /// 节点的输入操作数

执行完所有算子后，对于整个计算图的输出：

if (operators_maps_.find(output_name_) != operators_maps_.end()) {const auto& output_op = operators_maps_.at(output_name_);CHECK(output_op->output_operands != nullptr)<< "Output from" << output_op->name << " is empty";const auto& output_operand = output_op->output_operands;return output_operand->datas;
} else {LOG(FATAL) << "Can not find the output operator " << output_name_;return std::vector<std::shared_ptr<Tensor<float>>>{};
}

由于output_name在计算图的构建节点已经确定，所以直接在整个操作符map中查询输出节点，然后获取其输出操作数，并返回该操作数的数据部分。

Resnet模型需要的算子

Resnet所需要的算子有卷积层、最大池化层、ReLu激活函数、自适应池化层、全连接层。

Linear算子的编写和注册

Linear算子的初始化

首先LinearLayer仍然会继承Layer类，并增加父类没有的属性

class LinearLayer : public ParamLayer {public://  explicit LinearLayer(uint32_t batch, uint32_t in_channel, uint32_t in_dim, uint32_t out_dim, bool use_bias = true);explicit LinearLayer(int32_t in_features, int32_t out_features, bool use_bias);InferStatus Forward(const std::vector<std::shared_ptr<Tensor<float>>> &inputs,std::vector<std::shared_ptr<Tensor<float>>> &outputs) override;static ParseParameterAttrStatus GetInstance(const std::shared_ptr<RuntimeOperator> &op,std::shared_ptr<Layer> &linear_layer);private:int32_t in_features_ = 0;int32_t out_features_ = 0;bool use_bias_ = false;
};

接着看一下Linear的初始化函数

ParseParameterAttrStatus LinearLayer::GetInstance(const std::shared_ptr<RuntimeOperator>& op,std::shared_ptr<Layer>& linear_layer) {CHECK(op != nullptr) << "Linear operator is nullptr";const auto& params = op->params;if (params.find("bias") == params.end()) {LOG(ERROR) << "Can not find the use bias parameter";return ParseParameterAttrStatus::kParameterMissingUseBias;}auto use_bias_param =std::dynamic_pointer_cast<RuntimeParameterBool>(params.at("bias"));if (use_bias_param == nullptr) {LOG(ERROR) << "Can not find the use bias parameter";return ParseParameterAttrStatus::kParameterMissingUseBias;}const auto& attr = op->attribute;CHECK(!attr.empty()) << "Operator attributes is empty";if (attr.find("weight") == attr.end()) {LOG(ERROR) << "Can not find the weight parameter";return ParseParameterAttrStatus::kAttrMissingWeight;}if (use_bias_param->value) {if (attr.find("bias") == attr.end()) {LOG(ERROR) << "Can not find the bias parameter";return ParseParameterAttrStatus::kAttrMissingBias;}}const auto& weight = attr.at("weight");const auto& bias = attr.at("bias");const auto& shapes = weight->shape;if ((shapes.size() < 2)) {LOG(ERROR) << "The graph only support two dimension matrix multiply";return ParseParameterAttrStatus::kAttrMissingOutFeatures;}int32_t out_features = shapes.at(0);int32_t in_features = shapes.at(1);const bool use_bias = use_bias_param->value;linear_layer =std::make_shared<LinearLayer>(in_features, out_features, use_bias);if (use_bias) {linear_layer->set_bias(bias->get<float>());}// load weightslinear_layer->set_weights(weight->get<float>());return ParseParameterAttrStatus::kParameterAttrParseSuccess;
}

Linear的执行

Linear的计算公式为 $y=xW^T+b$ ，注意这里对W进行了转置。

看一下带参数的Forward函数，自己debug一下过程就能明白，主要就是借助arma计算库来完成操作。

InferStatus LinearLayer::Forward(const std::vector<std::shared_ptr<Tensor<float>>>& inputs,std::vector<std::shared_ptr<Tensor<float>>>& outputs) {if (inputs.empty()) {LOG(ERROR) << "The input tensor array in the linear layer is empty";return InferStatus::kInferFailedInputEmpty;}if (inputs.size() != outputs.size()) {LOG(ERROR) << "The input and output tensor array size of linear layer do ""not match";return InferStatus::kInferFailedInputOutSizeMatchError;}if (this->weights_.empty()) {LOG(ERROR) << "The weight tensor in the linear layer is empty";return InferStatus::kInferFailedWeightParameterError;} else {if (this->use_bias_ && this->weights_.size() != this->bias_.size()) {LOG(ERROR) << "The size of the weight and bias tensor do not match";return InferStatus::kInferFailedBiasParameterError;}}if (weights_.size() != 1) {LOG(ERROR) << "Need one weight tensor in the linear layer";return InferStatus::kInferFailedWeightParameterError;}if (use_bias_ && this->bias_.size() != 1) {LOG(ERROR) << "Need one bias tensor in the linear layer";return InferStatus::kInferFailedBiasParameterError;}uint32_t batch = inputs.size();const std::shared_ptr<Tensor<float>>& weight = weights_.front();arma::fmat weight_data(weight->raw_ptr(), out_features_, in_features_, false,true);const arma::fmat& weight_data_t = weight_data.t();for (uint32_t i = 0; i < batch; ++i) {const std::shared_ptr<Tensor<float>>& input = inputs.at(i);CHECK(input != nullptr && !input->empty())<< "The input tensor array in the linear layer has an empty tensor "<< i << " th";const std::vector<uint32_t>& input_shapes = input->shapes();const uint32_t feature_dims = input_shapes.at(1);const uint32_t in_features = input_shapes.at(2);CHECK(weight_data.n_rows == out_features_)<< "The row of weight tensor should be same to output_features_";CHECK(weight_data.n_cols == in_features && in_features == in_features_)<< "The col of weight tensor should be same to input_features_";arma::fmat input_vec((float*)input->raw_ptr(), feature_dims, in_features_,false, true);std::shared_ptr<Tensor<float>> output = outputs.at(i);if (output == nullptr || output->empty()) {output = std::make_shared<Tensor<float>>(1, feature_dims, out_features_);outputs.at(i) = output;}CHECK(output->channels() == 1 && output->rows() == feature_dims &&output->cols() == out_features_)<< "The row of output tensor should be same to feature_dims_ and the ""col of output tensor should be same to output_features_ "<< i << " th";const auto& output_raw_shapes = output->raw_shapes();if (output_raw_shapes.size() == 2) {CHECK(output_raw_shapes.at(0) == feature_dims &&output_raw_shapes.at(1) == out_features_);}if (output_raw_shapes.size() == 1) {CHECK(output_raw_shapes.at(0) == out_features_);}arma::fmat& result = output->slice(0);result = input_vec * weight_data_t;if (use_bias_) {CHECK(!this->bias_.empty() && this->bias_.size() == 1)<< "The bias tensor is empty, but use_bias is true";const auto& bias_data = bias_.front()->data();CHECK(!bias_data.empty() && bias_data.n_slices == 1 &&bias_data.n_cols == out_features_)<< "The col of bias tensor is not same to output_features_";const auto& bias_tensor = bias_data.slice(0);for (uint32_t row = 0; row < result.n_rows; ++row) {result.row(row) += bias_tensor;}}}return InferStatus::kInferSuccess;
}

Resnet分类网络的推理

总体流程概述

在Python中使用PyTorch对输入图像进行分类，大致可以分为以下几步：

加载预训练的模型
读取输入图像并进行预处理
使用模型对图像进行前向传播,得到预测结果
对预测结果进行后处理,得到图像的类别
输出分类结果

Resnet网络在KuiperInfer中的加载和Tensorrt作对比

首先利用Resnet模型文件和参数文件构建计算图，下面是Kuiperinfer代码：

const std::string& param_path = "course8/model_file/resnet18_batch1.param";
const std::string& weight_path = "course8/model_file/resnet18_batch1.pnnx.bin";
RuntimeGraph graph(param_path, weight_path);
graph.Build("pnnx_input_0", "pnnx_output_0");

我们来看一下tensorrt代码

# 1. 导出 PyTorch 模型为 ONNXmodel = models.resnet50(pretrained=True)
dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(model, dummy_input, "resnet50.onnx")# 2. 使用 TensorRT 构建计算图TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(TRT_LOGGER)
network = builder.create_network()onnx_parser = trt.OnnxParser(network, TRT_LOGGER)
with open("resnet50.onnx", 'rb') as f:onnx_parser.parse(f.read())builder.max_batch_size = 1
builder.max_workspace_size = 1 << 30  # 1GB
engine = builder.build_cuda_engine(network)# 3. 保存 TensorRT 引擎with open("resnet50.trt", "wb") as f:f.write(engine.serialize())

tensorrt中的引擎就是优化后的计算图，与之对照的就是kuiperinfer中的graph.

数据的预处理

kuiper_infer::sftensor PreProcessImage(const cv::Mat &image) {using namespace kuiper_infer;assert(!image.empty());// 调整输入大小cv::Mat resize_image;cv::resize(image, resize_image, cv::Size(224, 224));cv::Mat rgb_image;cv::cvtColor(resize_image, rgb_image, cv::COLOR_BGR2RGB);rgb_image.convertTo(rgb_image, CV_32FC3);std::vector<cv::Mat> split_images;cv::split(rgb_image, split_images);uint32_t input_w = 224;uint32_t input_h = 224;uint32_t input_c = 3;sftensor input = std::make_shared<ftensor>(input_c, input_h, input_w);uint32_t index = 0;for (const auto &split_image : split_images) {assert(split_image.total() == input_w * input_h);const cv::Mat &split_image_t = split_image.t();memcpy(input->slice(index).memptr(), split_image_t.data,sizeof(float) * split_image.total());index += 1;}float mean_r = 0.485f;float mean_g = 0.456f;float mean_b = 0.406f;float var_r = 0.229f;float var_g = 0.224f;float var_b = 0.225f;assert(input->channels() == 3);input->data() = input->data() / 255.f;input->slice(0) = (input->slice(0) - mean_r) / var_r;input->slice(1) = (input->slice(1) - mean_g) / var_g;input->slice(2) = (input->slice(2) - mean_b) / var_b;return input;
}

具体如何处理图像并不是我们关心的，这里面只用关心，预处理过程会逐通道的将数据拷贝到输入张量input中

memcpy(input->slice(index).memptr(), split_image_t.data, sizeof(float) * split_image.total());

由于split_image是cv::Mat，是行主序的，而input的底层是列主序的，所以这里先对split_image进行了转置，加快了的读写。

执行推理

auto outputs = graph.Forward(inputs, true);

预处理后，调用graph的Forward函数对输入图像进行推理，并将推理结果放到outputs张量数组中。这里并没有对输出张量进行softmax，所以还有一个后处理过程。

后处理过程

SoftmaxLayer softmax_layer(0);
std::vector<sftensor> outputs_softmax(batch_size);
softmax_layer.Forward(outputs, outputs_softmax);

outputs的输出维度是 $(1, 1000)$ ，随后，我们先对上一步【执行推理】中的输出张量计算它的softmax，softmax的计算公式如下：
$yi=eyi∑j=1Keyjy_i=\frac{e^{y_i}}{\sum_{j=1}^K e^{y_j}}$
其中 $y_j$ 表示输出的outputs张量中的各元素， $K$ 表示元素元素的个数，在这里就是1000个。

for (int i = 0; i < outputs_softmax.size(); ++i) {const sftensor& output_tensor = outputs_softmax.at(i);assert(output_tensor->size() == 1 * 1000);// 找到类别概率最大的种类float max_prob = -1;int max_index = -1;for (int j = 0; j < output_tensor->size(); ++j) {float prob = output_tensor->index(j);if (max_prob <= prob) {max_prob = prob;max_index = j;}}printf("class with max prob is %f index %d\n", max_prob, max_index);
}

随后，我们对softmax的输出张量求最大值，并找到其中概率最大的一种作为分类网络的预测结果。

结果分析

class with max prob is 0.663738 index 817

我们通过查阅ImageNet的类别表，可以知道第816 + 1个类别是运动型跑车，和本图可以对应上。至此，我们完成了KuiperInfer对Resnet网络的推理。

课堂作业

关于Resnet网络中其他用到的算子，请同学们自行分析它们的实现和计算过程。主要包括以下三个算子：
- Flatten层，详见 source/layer/details/flatten.cpp；
- 自适应池化层，详见 source/layer/details/adaptive_avgpooling.cpp；
- Softmax层，在后处理中使用，详见 source/layer/details/softmax.cpp。
请仔细分析本节课实现ResNet推理的各个步骤流程，包括模型的加载、图片的加载、对输入的预处理等，还有对模型中所有算子进行调度执行的 graph.Forward 方法的实现。

解答：

Flatten层的实现

Flatten层的作用是连接卷积层和全连接层。Flatten就是把多维数据展平成一维的，比如3x28x28的数据展平成1x2352（注：3x28x28=2352）;

看一下Flatten的定义

class FlattenLayer : public NonParamLayer {public:explicit FlattenLayer(int start_dim, int end_dim);InferStatus Forward(const std::vector<std::shared_ptr<Tensor<float>>>& inputs,std::vector<std::shared_ptr<Tensor<float>>>& outputs) override;static ParseParameterAttrStatus CreateInstance(const std::shared_ptr<RuntimeOperator>& op,std::shared_ptr<Layer>& flatten_layer);private:int start_dim_ = 0;int end_dim_ = 0;
};

其中start_dim和end_dim用于指定展平的维度范围，一般不展平批次维度。

然后看一下Flatten层重写的带参数的Forward函数，实现思路就是先检查start_dim和end_dim的合理性，然后确定指定维度展平后的元素个数，确定展平后的特征图的形状，最后调用reshape，重组到输出的形状。

InferStatus FlattenLayer::Forward(const std::vector<std::shared_ptr<Tensor<float>>>& inputs,std::vector<std::shared_ptr<Tensor<float>>>& outputs) {if (inputs.empty()) {LOG(ERROR) << "The input tensor array in the flatten layer is empty";return InferStatus::kInferFailedInputEmpty;}if (inputs.size() != outputs.size()) {LOG(ERROR) << "The input and output tensor array size of the flatten ""layer do not match";return InferStatus::kInferFailedInputOutSizeMatchError;}int start_dim = start_dim_;int end_dim = end_dim_;int total_dims = 4;  // NCHWif (start_dim < 0) {start_dim = total_dims + start_dim;}if (end_dim < 0) {end_dim = total_dims + end_dim;}CHECK(end_dim > start_dim) << "The end dim must greater than start dim";CHECK(end_dim <= 3 && start_dim >= 1)<< "The end dim must less than two and start dim must greater than zero";const uint32_t batch_size = inputs.size();for (uint32_t i = 0; i < batch_size; ++i) {const std::shared_ptr<Tensor<float>>& input = inputs.at(i);if (input == nullptr || input->empty()) {LOG(ERROR) << "The input tensor array in the flatten layer has"" an empty tensor "<< i << " th";return InferStatus::kInferFailedInputEmpty;}auto shapes = input->shapes();shapes.insert(shapes.begin(), batch_size);uint32_t elements_size =std::accumulate(shapes.begin() + start_dim,shapes.begin() + end_dim + 1, 1, std::multiplies());std::shared_ptr<Tensor<float>> output = outputs.at(i);output = TensorClone(input);CHECK(input->size() == output->size())<< "The output and input shapes of the flatten layer do ""not match "<< i << " th";outputs.at(i) = output;if (start_dim == 1 && end_dim == 3) { // 批次维度不展平，其他全展平output->Reshape({elements_size}, true);} else if (start_dim == 2 && end_dim == 3) { // 只展平长、宽维度uint32_t channels = input->channels();output->Reshape({channels, elements_size}, true);} else if (start_dim == 1 && end_dim == 2) { //展平通道、长维度uint32_t cols = input->cols();output->Reshape({elements_size, cols}, true);} else {LOG(FATAL) << "Wrong flatten dim: "<< "start dim: " << start_dim << " end dim: " << end_dim;}}return InferStatus::kInferSuccess;
}

其中的核心部分是：

if (start_dim == 1 && end_dim == 3) { // 批次维度不展平，其他全展平output->Reshape({elements_size}, true);
} else if (start_dim == 2 && end_dim == 3) { // 只展平长、宽维度uint32_t channels = input->channels();output->Reshape({channels, elements_size}, true);
} else if (start_dim == 1 && end_dim == 2) { //展平通道、长维度uint32_t cols = input->cols();output->Reshape({elements_size, cols}, true);
} else {LOG(FATAL) << "Wrong flatten dim: "<< "start dim: " << start_dim << " end dim: " << end_dim;
}

其中elements_size是展平指定维度包含的元素数量，比如一个[2,3,28,28]的数据，start_dim=2,end_dim=3,那么elements_size=28x28=796。start_dim和end_dim包含的区间是左闭右闭的。

由于我们已经封装好了Tensor的Reshape函数，所以直接调用即可，注意默认我们行主序进行展平。

自适应池化层

自适应池化层：用户只用给定输出特征图的尺寸大小即可，不用考虑kernel_size、stride、padding等参数，自适应池化层会根据用户给定的输出尺寸，自动确定kernel_size、stride、padding。前面我们知道池化层的输入尺寸和输出尺寸存在这样的关系：
$out_size=in_size−kernel_size+2∗paddingstride+1out\_size = \frac{in\_size-kernel\_size+2*padding}{stride}+1$
这个等式中存在5个变量，其中out_size和in_size确定，但是还有三个变量不确定，三个未知数按理说需要三个方程式才能求解，所以这里令 $stride=⌊in_size÷out_size⌋stride = \lfloor in\_size \div out\_size \rfloor$ ;同时默认padding为0，即无填充，然后便可计算求得kernel_size，即 $kernel\_size = in\_size-(out\_size-1)*stride$ ;

代码实现思路就是，首先根据公式求解得到kernel_size和stride，然后实现步骤和常规池化是一样的，且不用考虑padding，其中需要注意的点就是如何取得滑动窗口对应input窗口内的所有值，以及如何滑动窗口，由于底层数据是列主序的，所以整体都是先取列数据，再取行数据，填充到输出中也是，先填充列，再填充行。

InferStatus AdaptiveAveragePoolingLayer::Forward(const std::vector<std::shared_ptr<Tensor<float>>>& inputs,std::vector<std::shared_ptr<Tensor<float>>>& outputs) {if (inputs.empty()) {LOG(ERROR)<< "The input tensor array in the adaptive pooling layer is empty";return InferStatus::kInferFailedInputEmpty;}if (inputs.size() != outputs.size()) {LOG(ERROR) << "The input and output tensor array size of the adaptive ""pooling layer ""do not match";return InferStatus::kInferFailedInputOutSizeMatchError;}const uint32_t batch = inputs.size();for (uint32_t i = 0; i < batch; ++i) {const std::shared_ptr<ftensor>& input_data = inputs.at(i);const std::shared_ptr<ftensor>& output_data = outputs.at(i);if (input_data == nullptr || input_data->empty()) {LOG(ERROR) << "The input tensor array in the adaptive pooling layer has ""an empty tensor "<< i << "th";return InferStatus::kInferFailedInputEmpty;}if (output_data != nullptr && !output_data->empty()) {if (output_data->rows() != output_h_ ||output_data->cols() != output_w_) {LOG(ERROR) << "The output tensor array in the adaptive pooling layer ""has an incorrectly sized tensor "<< i << "th";return InferStatus::kInferFailedOutputSizeError;}}}for (uint32_t i = 0; i < batch; ++i) {const std::shared_ptr<Tensor<float>>& input_data = inputs.at(i);CHECK(input_data != nullptr && !input_data->empty())<< "The input tensor array in the adaptive pooling layer has an empty ""tensor "<< i << "th";const uint32_t input_h = input_data->rows();const uint32_t input_w = input_data->cols();const uint32_t input_c = input_data->channels();const uint32_t stride_h = uint32_t(std::floor(input_h / output_h_));const uint32_t stride_w = uint32_t(std::floor(input_w / output_w_));CHECK(stride_w > 0 && stride_h > 0)<< "The stride parameter is set incorrectly. It must always be greater ""than 0";const uint32_t pooling_h =(int)input_h - (int(output_h_) - 1) * int(stride_h);const uint32_t pooling_w =(int)input_w - (int(output_w_) - 1) * int(stride_w);CHECK(pooling_w > 0 && pooling_h > 0)<< "The pooling parameter is set incorrectly. It must always be ""greater than 0";std::shared_ptr<Tensor<float>> output_data = outputs.at(i);if (output_data == nullptr || output_data->empty()) {DLOG(ERROR) << "The output tensor array in the adaptive pooling layer ""has an empty tensor "<< i << "th";output_data =std::make_shared<Tensor<float>>(input_c, output_h_, output_w_);outputs.at(i) = output_data;}CHECK(output_data->rows() == output_h_ &&output_data->cols() == output_w_ &&output_data->channels() == input_c)<< "The output tensor array in the adaptive pooling layer has an ""incorrectly sized tensor "<< i << "th";const uint32_t pooling_size = pooling_h * pooling_w;for (uint32_t ic = 0; ic < input_c; ++ic) {const arma::fmat& input_channel = input_data->slice(ic);arma::fmat& output_channel = output_data->slice(ic);for (uint32_t c = 0; c < input_w - pooling_w + 1; c += stride_w) {int output_col = int(c / stride_w);for (uint32_t r = 0; r < input_h - pooling_h + 1; r += stride_h) {int output_row = int(r / stride_h);float mean_value = 0.f;float* output_channel_ptr = output_channel.colptr(output_col);for (uint32_t w = 0; w < pooling_w; ++w) {const float* col_ptr = input_channel.colptr(c + w) + r;for (uint32_t h = 0; h < pooling_h; ++h) {float current_value = *(col_ptr + h);mean_value = mean_value + current_value;}}*(output_channel_ptr + output_row) = mean_value / float(pooling_size);}}}}return InferStatus::kInferSuccess;
}

Softmax层

首先看Softmax层的定义

class SoftmaxLayer : public NonParamLayer {public:explicit SoftmaxLayer(int dim = -1);InferStatus Forward(const std::vector<std::shared_ptr<Tensor<float>>>& inputs,std::vector<std::shared_ptr<Tensor<float>>>& outputs) override;static ParseParameterAttrStatus CreateInstance(const std::shared_ptr<RuntimeOperator>& op,std::shared_ptr<Layer>& softmax_layer);private:int softmax_dim_ = -1;
};

其中，数据成员softmax_dim_用于指定softmax计算的方向，假设输入是[b,c,h,w]，dim不囊括批次维度，只管[c, h, w]维度，比如dim=0时，表明按照通道方向进行softmax计算（pytorch中的Softmax的dim是从批次维度开始，dim=0，表明沿着批次维度进行Softmax计算），通常批次间的数据互不影响，所以我们可以先不管批次维度，现在有一个三维数据[c, h, w]按照通道方向进行softmax计算，那么就是沿着通道方向取出一维数据，根据这个一维数据计算softmax值，遍历所有的(h,w)位置进行计算，指定其他维度同理。

Softmax层的计算公式： $yi=eyi∑j=1Keyjy_i=\frac{e^{y_i}}{\sum_{j=1}^K e^{y_j}}$ ；这个公式都是在上面提到的一维数据中执行的，K是指该一维数据的元素个数。在实际实现中，通常会把每个值减去其中的最大值再进行计算，这样可以避免溢出。

举个例子

# 第一个通道 (channel=0)[[1, 2, 3],  # [4, 5, 6]   # ],# 第二个通道 (channel=1)[[7, 8, 9],  # [10, 11, 12] # ]

按照通道维度进行softmax进行计算的结果是：

[[[0.0025, 0.0025, 0.0025],[0.0025, 0.0025, 0.0025]],[[0.9975, 0.9975, 0.9975],[0.9975, 0.9975, 0.9975]]]

其实无论输入数据有多少个维度，我们只要先固定除指定的那个维度的其他所有维度的坐标，然后沿着指定维度的方向取出对应的一维数据，对该一维数据进行softmax计算即可，移动坐标重复上述计算过程即可。

来看Forward函数

InferStatus SoftmaxLayer::Forward(const std::vector<std::shared_ptr<Tensor<float>>>& inputs,std::vector<std::shared_ptr<Tensor<float>>>& outputs) {if (inputs.empty()) {LOG(ERROR) << "The input tensor array in the softmax layer is empty";return InferStatus::kInferFailedInputEmpty;}if (inputs.size() != outputs.size()) {LOG(ERROR) << "The input and output tensor array size of the softmax layer ""do not match";return InferStatus::kInferFailedInputOutSizeMatchError;}const uint32_t batch_size = inputs.size();for (uint32_t i = 0; i < batch_size; ++i) {const std::shared_ptr<Tensor<float>>& input = inputs.at(i);CHECK(input != nullptr && !input->empty())<< "The input tensor array in the softmax layer has an empty tensor "<< i << " th";std::shared_ptr<Tensor<float>> output = outputs.at(i);if (output == nullptr || output->empty()) {output = std::make_shared<Tensor<float>>(input->shapes());outputs.at(i) = output;}CHECK(input->shapes() == output->shapes())<< "The input and output tensor shapes of the softmax layer do not ""match "<< i << " th";int dim = this->softmax_dim_;std::vector<uint32_t> raw_shapes = input->raw_shapes();if (dim < 0) {dim += int(raw_shapes.size());}if (dim < 0 || dim >= 3 || dim > raw_shapes.size()) {LOG(FATAL) << "Error softmax dimension, which need between 0 and 2, ""but dimension is "<< dim;}const uint32_t padding_size_num = 3 - raw_shapes.size();for (uint32_t j = 0; j < padding_size_num; ++j) {raw_shapes.push_back(1);}/*** [...(inner size) dim ...(outer_size)* 将输入的数据按dim维度拆分为两部分，分别为inner和outer* 开始位置到dim轴位置的数据量是inner_size,* dim轴位置到结束位置的数据量是outer_sizes*/const uint32_t inner_sizes = std::accumulate(raw_shapes.begin() + dim + 1, raw_shapes.end(), 1, std::multiplies());const uint32_t outer_sizes = std::accumulate(raw_shapes.begin(), raw_shapes.begin() + dim, 1, std::multiplies());// dim轴数据的数量const uint32_t axis_sizes = raw_shapes.at(dim);CHECK_EQ(axis_sizes * outer_sizes * inner_sizes, input->size());const auto& input_values = input->values(true);std::vector<float> output_values(input_values.size());for (uint32_t outer_size = 0; outer_size < outer_sizes; ++outer_size) {for (uint32_t inner_size = 0; inner_size < inner_sizes; ++inner_size) {float max_value = std::numeric_limits<float>::lowest();// 迭代当前dim中的数据，并找到其中的最大值for (uint32_t axis_size = 0; axis_size < axis_sizes; ++axis_size) {uint32_t index = POS_INDEX(outer_size, inner_size, axis_size);float cur_value = input_values.at(index);if (cur_value > max_value) {max_value = cur_value;}}float sum_value = 0.f;// 迭代当前dim中的数据，并进行求和for (uint32_t axis_size = 0; axis_size < axis_sizes; ++axis_size) {uint32_t index = POS_INDEX(outer_size, inner_size, axis_size);float cur_value = input_values.at(index);float exp_sub_value = fmath::exp(cur_value - max_value);sum_value += exp_sub_value;output_values.at(index) = exp_sub_value;}// 迭代当前dim中的数据，求exp(cur_value - max_value) / sum_valuefor (uint32_t axis_size = 0; axis_size < axis_sizes; ++axis_size) {uint32_t index = POS_INDEX(outer_size, inner_size, axis_size);float exp_sub_value = output_values.at(index);output_values.at(index) = exp_sub_value / sum_value;}}}output->Fill(output_values, true);}return InferStatus::kInferSuccess;
}

注意具体实现的时候，并不是计算一个一维数据的softmax后就马上填充到最终输出的对应位置中，而是所有位置的一维数据计算完之后，再统一按照行主序的方式填充到输出中，因为无论张量的形状是怎样的，其实在内存中都是线性排布的，这样一次性填充，可以减少内存的开销。

又是一个trick，学到了，所以我们可以关注一下作者是如何确定内存位置的，首先将输入的数据按dim维度拆分为两部分，分别为inner和outer，比如一个[3, 28, 28]形状的张量，如果dim=1，那么意思就是[outer, dim, inner]，outer对应的outer_sizes=3, dim对应的axis_sizes=28, inner对应的inner_sizes=28;如果dim=0，那么意思就是[outer, dim, inner]，outer对应的outer_sizes=1, dim对应的axis_sizes=3, inner对应的inner_sizes=28*28=784。对于张量是一维数据或者二维数据的情况，会将其形状维度填充到3维，比如形状为1000的，会填充形状为[1000, 1, 1];然后正常划分outer和inner即可。

注意outer_sizes和inner_sizes的计算公式：

const uint32_t inner_sizes = std::accumulate(raw_shapes.begin() + dim + 1, raw_shapes.end(), 1, std::multiplies());
const uint32_t outer_sizes = std::accumulate(raw_shapes.begin(), raw_shapes.begin() + dim, 1, std::multiplies());

这样就可以用坐标的形式定位数据，由于数据是行主序的，所以坐标为[outer_size, axis_size, inner_size]的三维数据转换为线性坐标就是

pos_index=outer_size*axis_sizes*inner_sizes+axis_size*inner_sizes+inner_size

注：以上讨论，批次间的数据都是独立的，比如同一个批次中的第一个数据和第二个数据相互独立，互不影响。

debug整个推理过程，再感受一下流程

传入模型结构文件.pnnx和模型权重文件.bin。然后new一个RuntimeGraph对象RuntimeGraph graph(param_path, weight_path)，接着对graph会经历初始化init、构建build两个阶段，init阶段会读取模型结构文件和权重文件，RuntimeGraph含有一个类型为pnnx::Graph的数据成员graph_（可以理解为RuntimeGraph是pnnx::Graph的进一步封装），init阶段首先会调用graph_对象load函数，load函数会构建该graph_，初始化其中的数据成员，也就是graph_的数据成员，graph_中的计算节点，这些节点均是pnnx层次的，比如pnnx::Operator和pnnx::Operand，graph_初始化后，init函数接着会根据graph_中的信息来构建RuntimeGraph，会初始化算子的名称、记录算子的input输入操作数operand的信息（名称和形状）、记录算子的输出操作符operator的名称、初始化算子中的attribute(权重)、初始化算子中的parameter。总的来说，初始化阶段初始化了节点的权重、参数、记录了节点的后继节点、输入输出信息。

初始化完毕后，进入Build构建阶段，首先构建图关系，根据init阶段记录的节点的后继节点（输出操作符）名称，从operators_maps_找出该节点，并将整个该节点插入到当前节点的后继节点中。构建好图关系以后，为每个计算节点构建计算层layer（图的输入输出节点不构建计算层），计算层中包含了实际计算时的计算操作，在其Forward函数中，计算节点和计算层是相互对应的关系。然后提前为每个节点的输出分配空间，由于节点的输出是下一个节点的输入，所以不用为节点的输入分配空间。然后构建拓扑排序，因为实际执行仍然是线性执行的，确定计算节点的先后执行顺序。

构建完毕后，预处理好输入后，执行RuntimeGraph的Forward函数，根据拓扑排序，依次执行每个计算节点得到图的输出，再进行后处理，得到最终推理的结果。

查看全文

http://www.xdnf.cn/news/15619.html