Shiyu Tang, Ting Sun, Juncai Peng, Guowei Chen, Yuying Hao, Manhui Lin, Zhihong Xiao, Jiangbin You, Yi Liu. PP-MobileSeg: Explore the Fast and Accurate Semantic Segmentation Model on Mobile Devices. https://arxiv.org/abs/2304.05152
- Overview
- Performance
- Reproduction
With the success of transformers in computer vision, several attempts have been made to adapt transformers to mobile devices. However, their performance is not satisfied for some real world applications. Therefore, we propose PP-MobileSeg, a SOTA semantic segmentation model for mobile devices.
It is composed of three newly proposed parts, the strideformer backbone, the Aggregated Attention Module(AAM), and the Valid Interpolate Module(VIM):
- With the four-stage MobileNetV3 block as the feature extractor, we manage to extract rich local features of different receptive fields with little parameter overhead. Also, we further efficiently empower features from the last two stages with the global view using strided sea attention.
- To effectively fuse the features, we use AAM to filter the detail features with ensemble voting and add the semantic feature to it to enhance the semantic information to the most content.
- At last, we use VIM to upsample the downsampled feature to the original resolution and significantly decrease latency in model inference stage. It only interpolates classes present in the final prediction which only takes around 10% in the ADE20K dataset. This is a common scenario for datasets with large classes. Therefore it significantly decreases the latency of the final upsample process which takes the greatest part of the model's overall latency.
Extensive experiments show that PP-MobileSeg achieves a superior params-accuracy-latency tradeoff compared to other SOTA methods.
Model | Backbone | Training Iters | Batchsize | Train Resolution | mIoU(%) | latency(ms)* | params(M) | Links |
---|---|---|---|---|---|---|---|---|
PP-MobileSeg-Base | StrideFormer-Base | 80000 | 32 | 512x512 | 41.57% | 265.5 | 5.62 | config|model|log|vdl|exported model |
PP-MobileSeg-Tiny | StrideFormer-Tiny | 80000 | 32 | 512x512 | 36.39% | 215.3 | 1.61 | config|model|log|vdl|exported model |
Model | Backbone | mIoU(%) | latency(ms)* | params(M) |
---|---|---|---|---|
LR-ASPP | MobileNetV3_large_x1_0 | 33.10 | 730.9 | 3.20 |
MobileSeg-Base | MobileNetV3_large_x1_0 | 33.26 | 391.5 | 2.85 |
TopFormer-Tiny | TopTransformer-Tiny | 32.46 | 490.3 | 1.41 |
SeaFormer-Tiny | SeaFormer-Tiny | 35.00 | 459.0 | 1.61 |
PP-MobileSeg-Tiny | StrideFormer-Tiny | 36.39 | 215.3 | 1.44 |
TopFormer-Base | TopTransformer-Base | 38.28 | 480.6 | 5.13 |
SeaFormer-Base | SeaFormer-Base | 40.07** | 465.4 | 8.64 |
PP-MobileSeg-Base | StrideFormer-Base | 41.57 | 265.5 | 5.62 |
Model | Backbone | Train Resolution | mIoU(%) | latency(ms)* | params(M) | Links |
---|---|---|---|---|---|---|
baseline | Seaformer-Base | 512x512 | 40.00% | 465.6 | 8.27 | model|log|vdl|exported model |
+VIM | Seaformer-Base | 512x512 | 40.07% | 234.6 | 8.17 | model|log|vdl|exported model |
+VIM+StrideFormer | StrideFormer-Base | 512x512 | 40.98% | 235.1 | 5.54 | model|log|vdl|exported model |
+VIM+StrideFormer+AAM | StrideFormer-Base | 512x512 | 41.57% | 265.5 | 5.62 | model|log|vdl|exported model |
* Note that the latency is test with the final argmax operator using PaddleLite on xiaomi9 (Snapdragon 855 CPU) with single thread and 512x512 as input shape. Therefore the output of model is the segment result with single channel rather then probability logits. Inspired by the ineffectiveness of the final argmax operator that greatly increase the overall latency, we designed VIM to significantly decrease the latency.
** The accuracy is reported based on self-trained reproduced result.
- Install PaddlePaddle and relative environments based on the installation guide.
- Install PaddleSeg based on the reference.
- Download the ADE20k dataset and link to PaddleSeg/data, or you can directly run the training script. The dataset will be automatically downloaded.
PaddleSeg/data
├── ADEChallengeData2016
│ ├── ade20k_150_embedding_42.npy
│ ├── annotations
│ ├── annotations_detectron2
│ ├── images
│ ├── objectInfo150.txt
│ └── sceneCategories.txt
You can start training by assign the tools/train.py
with config files, the config files are under PaddleSeg/configs/pp_mobileseg
. Details about training are under training guide. You can find the trained models under Paddleseg/save/dir/best_model/model.pdparams
export CUDA_VISIBLE_DEVICES=0,1
python3 -m paddle.distributed.launch tools/train.py \
--config configs/pp_mobileseg/pp_mobileseg_base_ade20k_512x512_80k.yml \
--save_dir output/pp_mobileseg_base \
--save_interval 1000 \
--num_workers 4 \
--log_iters 100 \
--use_ema \
--do_eval \
--use_vdl
With the trained model on hand, you can verify the model's accuracy through evaluation. Details about evaluation are under evaluation guide.
python -m paddle.distributed.launch tools/val.py \
--config configs/pp_mobileseg/pp_mobileseg_base_ade20k_512x512_80k.yml \
--model_path output/pp_mobileseg_base/best_model/model.pdparams
We deploy the model on mobile devices for inference. To do that, we need to export the model and use PaddleLite to inference on mobile devices. You can also refer to lite deploy guide for details of PaddleLite deployment.
- An android mobile phone with usb debugger mode on and are already linked to your PC.
- Install the adb tool.
Run the following command to make sure you are ready:
adb devices
# The following information will show if you are good to go:
List of devices attached
017QXM19C1000664 device
The model needs to be transferred from dynamic graph to static graph for PaddleLite inference. In this step, we can use VIM
to speed the model up. You only need to change model::upsample
to vim
in the config file, and the exported model can be found on the PaddleSeg/save/dir
python tools/export.py \
--config configs/pp_mobileseg/pp_mobileseg_base_ade20k_512x512_80k.yml \
--save_dir output/pp_mobileseg_base \
--input_shape 1 3 512 512 \ # The model is set to infer one image with this input shape, feel free to suit this to your dataset.
--output_op none # If do not use VIM, you need to set this to argmax to get the final prediction rather than logits.
- After the model is exported, you can download all the exported files and tool zipfile as shown in the following file tree.
Speed_test_dir
├── models_dir
│ ├── pp_mobileseg_base # Files under this directory is generated through exportation
│ │ ├── model.pdmodel
│ │ ├── mdoel.pdiparams
│ │ ├── model.pdiparams.info
│ │ └── deploy.yaml
│ ├── pp_mobileseg_tiny
│ │ ├── model.pdmodel
│ │ ├── mdoel.pdiparams
│ │ ├── model.pdiparams.info
│ │ └── deploy.yaml
├── benchmark_bin # The complied testscript of PaddleLite, which is in the tool zipfile.
├── image1.txt # The txt file that stores the value of resized and normalized image
└── gen_val_txt.py # You can use this script to generate the image1.txt for your test image
- And you can test the speed of the model using the following script. The tested result will be shown in the test_result.txt.
sh benchmark.sh benchmark_bin models_dir test_result.txt image1.txt
The test result on our PP-MobileSeg-Base is as following:
-----------------Model=MV3_4stage_AAMSx8_valid_0321 Threads=1-------------------------
Delete previous optimized model: /data/local/tmp/seg_benchmark/models_0321/MV3_4stage_AAMSx8_valid_0321/opt.nb
---------- Opt Info ----------
Load paddle model from /data/local/tmp/seg_benchmark/models_0321/MV3_4stage_AAMSx8_valid_0321/model.pdmodel and /data/local/tmp/seg_benchmark/models_0321/MV3_4stage_AAMSx8_valid_0321/model.pdiparams
Save optimized model to /data/local/tmp/seg_benchmark/models_0321/MV3_4stage_AAMSx8_valid_0321/opt.nb
---------- Device Info ----------
Brand: Xiaomi
Device: cepheus
Model: MI 9
Android Version: 9
Android API Level: 28
---------- Model Info ----------
optimized_model_file: /data/local/tmp/seg_benchmark/models_0321/MV3_4stage_AAMSx8_valid_0321/opt.nb
input_data_path: /data/local/tmp/seg_benchmark/image1_norm.txt
input_shape: 1,3,512,512
output tensor num: 1
--- output tensor 0 ---
output shape(NCHW): 1 512 512
output tensor 0 elem num: 262144
output tensor 0 mean value: 1.18468e-44
output tensor 0 standard deviation: 2.52949e-44
---------- Runtime Info ----------
benchmark_bin version: e79b4b6
threads: 1
power_mode: 0
warmup: 20
repeats: 50
result_path:
---------- Backend Info ----------
backend: arm
cpu precision: fp32
---------- Perf Info ----------
Time(unit: ms):
init = 33.071
first = 314.619
min = 265.450
max = 271.217
avg = 267.246