Many software systems provide users with a set of configuration options and different configurations may lead to different runtime performance of the system. It is necessary to understand the performance of a system under a certain configuration, before the system is actually configured and deployed. This helps users make rational decisions in configurations and reduce performance testing cost. As the combination of configurations could be exponential, it is difficult to exhaustively deploy and measure system performance under all possible configurations. Recently, several learning methods have been proposed to build a performance prediction model based on performance data collected from a small sample of configurations, and then use the model to predict system performance with a new configuration. DeepPerf is an end-to-end deep learning based solution that can train a software performance prediction model from a limited number of samples and predict the performance value of software system under a new configuration. DeepPerf consists of two main stages:
- Stage 1: Tune the hyperparameters of the neural network
- Stage 2: Utilize the hyperparameters in Stage 1 to train the neural network with the samples and predict the performance value of software system under a new configuration.
If you find our code useful, please cite our paper:
@inproceedings{Ha2019DeepPerf,
author = {Huong Ha and
Hongyu Zhang},
title = {DeepPerf: performance prediction for configurable software with deep
sparse neural network},
booktitle = {Proceedings of the 41st International Conference on Software Engineering,
{ICSE} 2019, Montreal, QC, Canada, May 25-31, 2019},
pages = {1095--1106},
publisher = {{IEEE} / {ACM}},
year = {2019}
}
- Python 3.6.x
- Tensorflow (tested with tensorflow 1.10.0, 1.8.0)
DeepPerf can be directly executed through source code
-
Download and install Python 3.6.x here.
-
Install Tensorflow
$ pip install tensorflow==1.10.0
-
Clone DeepPerf
$ clone https://github.com/DeepPerf/DeepPerf.git
DeepPerf has been evaluated on 11 real-world configurable software systems:
- Apache
- LLVM
- x264
- BDBC
- BDBJ
- SQL
- Dune
- hipacc
- hsmgp
- javagc
- sac
Six of these systems have only binary configuration options, the other five systems have both binary and numeric configuration options. The data is store in the DeepPerf\Data directory. These software systems were measured and published online by the SPLConqueror team. More information of these systems and how they were measured can be found in here.
To run DeepPerf, users need to specify the name of the software system they wish to evaluate and then run the script AutoDeepPerf.py
. There are 11 software systems that users can evaluate: Apache, LLVM, x264, BDBC, BDBJ, SQL, Dune, hipacc, hsmgp, javagc, sac. The script will then evaluate DeepPerf on the chosen software system with the same experiment setup presented in our paper. Specifically, for binary software systems, DeepPerf will run with five different sample sizes: n, 2n, 3n, 4n, 5n with n being the number of options, and 30 experiments for each sample size. For binary-numeric software systems, DeepPerf will run with the sample sizes specified in Table IV of our paper, and 30 experiments for each sample size. For example, if users want to evaluate DeepPerf with the system LLVM, the command line to run DeepPerf will be:
$ python AutoDeepPerf.py LLVM
When finishing each sample size, the script will output a .csv file that shows the mean prediction error and the margin (95% confidence interval) of that sample size over the 30 experiments. These results will be same/similar as the results we report in Table III and IV of our paper.
Alternatively, users can customize the sample size and/or the number of experiments for each sample size by using the optional arguments -ss
and -ne
. For example, to set the sample size = 20 and the number of experiments = 10, the corresponding command line is:
$ python AutoDeepPerf.py LLVM -ss 20 -ne 10
Setting none or one option will result in the other option(s) running with the default setting. The default setting of the number of experiments is 30. The default setting of the sample size is: (a) the five different sample sizes: n, 2n, 3n, 4n, 5n, with n being the number of configuration options, when the evaluated system is a binary system OR (b) the four sample sizes specified in Table IV of our paper when the evaluated system is a binary-numeric system.
NOTE: The time cost of tuning hyperparameters and training the final neural network for each experiment ranges from 2-20 minutes depends on the software system, the sample size and the user's CPU. Typically, the time cost will be smaller when the software systems has smaller number of configurations or when the sample size is small. Therefore, please be aware that for each sample size, the time cost of evaluating 30 experiments ranges from 1 hour to 10 hours.
To evaluate the prediction accuracy, we use the mean relative error (MRE), which is computed as,
where V is the testing dataset, predicted_c is the predicted performance value of configuration c generated using the model, actual_c is the actual performance value of configuration c. In the two tables below, Mean is the mean of the MREs seen in 30 experiments and Margin is the margin of the 95% confidence interval of the MREs in the 30 experiments. The results are obtained when evaluating DeepPerf on a Windows 7 computer with Intel Xeon CPU E5-1650 3.2GHz 16GB RAM.
Subject System | Sample Size | DECART | DeepPerf | ||
---|---|---|---|---|---|
Mean | Margin | Mean | Margin | ||
Apache | n | NA | NA | 17.87 | 1.85 |
2n | 15.83 | 2.89 | 10.24 | 1.15 | |
3n | 11.03 | 1.46 | 8.25 | 0.75 | |
4n | 9.49 | 1.00 | 6.97 | 0.39 | |
5n | 7.84 | 0.28 | 6.29 | 0.44 | |
x264 | n | 17.71 | 3.87 | 10.43 | 2.28 |
2n | 9.31 | 1.30 | 3.61 | 0.54 | |
3n | 6.37 | 0.83 | 2.13 | 0.31 | |
4n | 4.26 | 0.47 | 1.49 | 0.38 | |
5n | 2.94 | 0.52 | 0.87 | 0.11 | |
BDBJ | n | 10.04 | 4.67 | 7.25 | 4.21 |
2n | 2.23 | 0.16 | 2.07 | 0.32 | |
3n | 2.03 | 0.16 | 1.73 | 0.12 | |
4n | 1.72 | 0.09 | 1.67 | 0.12 | |
5n | 1.67 | 0.09 | 1.61 | 0.09 | |
LLVM | n | 6.00 | 0.34 | 5.09 | 0.80 |
2n | 4.66 | 0.47 | 3.87 | 0.48 | |
3n | 3.96 | 0.39 | 2.54 | 0.15 | |
4n | 3.54 | 0.42 | 2.27 | 0.16 | |
5n | 2.84 | 0.33 | 1.99 | 0.15 | |
BDBC | n | 151.0 | 90.70 | 133.6 | 54.33 |
2n | 43.8 | 26.72 | 16.77 | 2.25 | |
3n | 31.9 | 22.73 | 13.1 | 3.39 | |
4n | 6.93 | 1.39 | 6.95 | 1.11 | |
5n | 5.02 | 1.69 | 5.82 | 1.33 | |
SQL | n | 4.87 | 0.22 | 5.04 | 0.32 |
2n | 4.67 | 0.17 | 4.63 | 0.13 | |
3n | 4.36 | 0.09 | 4.48 | 0.08 | |
4n | 4.21 | 0.1 | 4.40 | 0.14 | |
5n | 4.11 | 0.08 | 4.27 | 0.13 |
Subject System | Sample Size | SPLConqueror | DeepPerf | |||
---|---|---|---|---|---|---|
Sampling Heuristic | Mean | Sampling Heuristic | Mean | Margin | ||
Dune | 49 | OW RD | 20.1 | RD | 15.73 | 0.90 |
78 | PW RD | 22.1 | RD | 13.67 | 0.82 | |
240 | OW PBD(49, 7) | 10.6 | RD | 8.19 | 0.34 | |
375 | OW PBD(125, 5) | 18.8 | RD | 7.20 | 0.17 | |
hipacc | 261 | OW RD | 14.2 | RD | 9.39 | 0.37 |
528 | OW PBD(125, 5) | 13.8 | RD | 6.38 | 0.44 | |
736 | OW PBD(49, 7) | 13.9 | RD | 5.06 | 0.35 | |
1281 | PW RD | 13.9 | RD | 3.75 | 0.26 | |
hsmgp | 77 | OW RD | 4.5 | RD | 6.76 | 0.87 |
173 | PW RD | 2.8 | RD | 3.60 | 0.2 | |
384 | OW PBD(49, 7) | 2.2 | RD | 2.53 | 0.13 | |
480 | OW PBD(125, 5) | 1.7 | RD | 2.24 | 0.11 | |
javagc | 423 | OW PBD(49, 7) | 37.4 | RD | 24.76 | 2.42 |
534 | OW RD | 31.3 | RD | 23.27 | 4.00 | |
855 | OW PBD(125, 5) | 21.9 | RD | 21.83 | 7.07 | |
2571 | OW PBD(49, 7) | 28.2 | RD | 17.32 | 7.89 | |
sac | 2060 | OW RD | 21.1 | RD | 15.83 | 1.25 |
2295 | OW PBD(125, 5) | 20.3 | RD | 17.95 | 5.63 | |
2499 | OW PBD(49, 7) | 16 | RD | 17.13 | 2.22 | |
3261 | PW RD | 30.7 | RD | 15.40 | 2.05 |