Release Notes

ARM CPU

1、新增kernel为3x3的滑窗直接卷积实现，在channel数较少时会比winograd和gemm效率更高
2、新增winograd armv8的实现，在IOS以及v8的硬件上能取得更高的预测性能，以及算子融合时支持winograd ，保证算子融合后的效率更高
3、新增了while、sequence_expand、sequence_pool、sequence_softmax、gru_unit、beam_search和beam_search_decode等19个算子，并做了大量的优化工作，支持NLP/OCR等attention-based端到端模型的预测
4、完成矩阵运算库sgemm和sgemv的重构和效率优化，在大部分模型上能获得10%～100%以上的性能加速
5、完成kernel为3x3的depthwise convolution的重构和优化，相比之前版本支持任意的padding、性能更优且计算结果更可靠
6、完成kernel为5x5的depthwise convolution armv8版本的实现，NAS模型的预测效率提升30%以上
7、完成col2im的neon优化，提升反卷积conv2d_transpose的效率
8、新增基于图优化的精简内存复用策略，大部分模型能降低近50%的内存占用。ARM CPU已自动开启，FPGA和GPU暂不支持

Paddle-mobile has reconstructed and enhanced efficiency of the matrix operation library sgemm and sgemv, which gives rise to performance boost of 10%~100% on most models.
19 new operators are provided in this version such as while, sequence_expand, sequence_pool, sequence_softmax, gru_unit, beam_search, and beam_search_decode. Apart from that, there has also been a large amount of optimization, and the support attention-based end-to-end Model prediction.
arm v8 of winograd implementation: higher inference performance on v8 hardware on IOS; winograd support for operator fusion to ensure higher efficiency after operator fusion.
Direct convolution for kernel with a 3x3 sliding window, which will be more efficient than winograd and gemm on the condition that the number of channels is small.
Reconstructed and optimized depthwise convolution with the kernel size 3x3: in contrast to previous versions, it supports arbitrary padding, and attains better performance and returns more reliable calculation results.
Depthwise convolution with the kernel size 5x5 on armv8: the NAS model prediction speeds up by more than 30%.
Complete the efficiency optimization of the deconvolution conv2d_transpose.
Consolidated with memory reuse strategy based on graph optimization. When the strategy is applied, most models can reduce memory usage by nearly 50%. It is automatically turned on for the ARM CPU (not compatible with FPGA and GPU).

Paddle-mobile completes the convolution optimization for the kernel with size 1x1, and MobileNet v1 has an average inference performance improvement of 35% on Qualcomm Adreno GPUs.