This repository contains the implementation of the Regression Analysis and Workload Characterization Components of RAST.
- The Training Database: refers to the SQLite file produced by the ETL component of RAST. It contains the information about the requests processed by the System under Test (SUT), like timestamp, parallel requests executed, request time, and processing time.
- Requests_per_time_unit_*.log: [TODO]
DataFrame
: refers to the DataFrame class provided by the pandas library.- scikit-learn: is the library we use for machine learning.
- Reads the training data from the database into a
DataFrame
. - Performs outlier detection and removal.
- Splits the training data into train and test subsets.
- Performs cross-validation using a series of different scikit-learn estimators.
- Fits a specific estimator using the train subset and then evaluates the performance of the estimator using the test subset.
- Produces the predictive model and request type mapping by exporting the estimator and the dictionary containing the mapping.
- Stores the model in the
regression_analysis_results
folder in the project. Each exported model is placed in a separate folder.
Additionally, there is a lot of commented code for visualization of the training data using matplotlib or plotly.
RegressionAnalysis.py [OPTIONS] [DATABASE_PATH] [ESTIMATOR_TO_USE]
[DATABASE_PATH]
- Path to the training database to load. Default:db/trainingdata_cumulative.db
[ESTIMATOR_TO_USE]
- Estimator to use. Can beRidge
orDT
. Default:Ridge
--help
- Show this message and exit.
After creating a training database, it is placed in the db
folder of the ML_ETL project by default. Run this command to create a predictive model for this specific training database:
python RegressionAnalysis.py ../ML_ETL/db/trainingdata_2024-05-24.db
- Reads requests_per_time_unit_*.logs into
two
DataFrames
containing the extracted workload patterns. - Calculates average and median requests per hour and requests per day.
- Using plotly visualizes the daily workload and the median number of requests per day.
- Exports the plots as .pdf Files.
- Reads the training data from the database
into a
DataFrame
. - Calculates and prints out the list of:
- different request types and the total number of each request type found in the training data;
- the 25 most processed requests;
- the number of different requests.
Usage: WorkloadCharacterization.py [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
Commands:
use-db Uses the db as the datasource for workload characterization.
use-files Uses the requests_per_time_unit_*.logs as the datasource for
workload characterization.
Compares two databases produced by the ETL for their similarity. The use case for this script is to evaluate the quality of the Simulator component of RAST by comparing the database used for training with a database that was produced based on the logs of the Simulator.
Contain general functions commonly used by the other scripts, like reading the contents of the Training Database and performing outlier detection.