diff --git a/README.md b/README.md index 5bb775c..2c2a443 100644 --- a/README.md +++ b/README.md @@ -2,8 +2,9 @@ # ECNet: Large scale machine learning projects for fuel property prediction -[![status](http://joss.theoj.org/papers/f556afbc97e18e1c1294d98e0f7ff99f/status.svg)](http://joss.theoj.org/papers/f556afbc97e18e1c1294d98e0f7ff99f) +[![GitHub version](https://badge.fury.io/gh/tjkessler%2Fecnet.svg)](https://badge.fury.io/gh/tjkessler%2Fecnet) [![PyPI version](https://badge.fury.io/py/ecnet.svg)](https://badge.fury.io/py/ecnet) +[![status](http://joss.theoj.org/papers/f556afbc97e18e1c1294d98e0f7ff99f/status.svg)](http://joss.theoj.org/papers/f556afbc97e18e1c1294d98e0f7ff99f) [![GitHub license](https://img.shields.io/badge/license-MIT-blue.svg)](https://raw.githubusercontent.com/TJKessler/ECNet/master/LICENSE.txt) **ECNet** is an open source Python package for creating large scale machine learning projects with a focus on fuel property prediction. A __project__ is considered a collection of __builds__, and each build is a collection of __nodes__. Nodes are averaged to obtain a final predicted value for the build. For each node in a build, multiple neural networks are constructed and the best performing neural network is used as that node's predictor. Using multiple nodes allows a build to learn from multiple learning and validation sets, reducing the build's error. Projects can be saved and reused. @@ -12,9 +13,6 @@ Using ECNet, [T. Kessler et al.](https://doi.org/10.1016/j.fuel.2017.06.015) have increased the generalizability of ANN's to predict the cetane numbers for molecules from a variety of molecular classes represented in the [cetane number database](https://github.com/TJKessler/ECNet/tree/master/databases), and have increased the accuracy of ANN's for predicting the cetane numbers for molecules from underrepresented molecular classes through targeted database expansion. -Here is a visual represntation of a build for cetane number prediction: -![Build Diagram](https://github.com/TJKessler/ECNet/blob/master/misc/build_figure.png) - # Installation: ### Prerequisites: @@ -34,7 +32,7 @@ Note: if multiple Python releases are installed on your system (e.g. 2.7 and 3.5 - Download the ECNet repository, navigate to the download location on the command line/terminal, and execute **"python setup.py install"**. -Additional package dependencies (TensorFlow, PyYaml) will be installed during the ECNet installation process. +Additional package dependencies (TensorFlow, PyYaml, ecabc, PyGenetics) will be installed during the ECNet installation process. To update your version of ECNet to the latest release version, use "**pip install --upgrade ecnet**". @@ -54,7 +52,7 @@ data_split: - 0.65 - 0.25 - 0.1 -learning_rate: 0.05 +learning_rate: 0.1 mlp_hidden_layers: - - 32 - relu @@ -62,16 +60,13 @@ mlp_hidden_layers: - relu mlp_in_layer_activ: relu mlp_out_layer_activ: linear -normals_use: false project_name: my_project -project_num_builds: 1 -project_num_nodes: 1 -project_num_trials: 5 +project_num_builds: 10 +project_num_nodes: 5 +project_num_trials: 75 project_print_feedback: true -train_epochs: 2500 -valid_max_epochs: 7500 -valid_mdrmse_memory: 250 -valid_mdrmse_stop: 0.00007 +train_epochs: 500 +valid_max_epochs: 15000 ``` Here are brief explanations of each of these variables: @@ -84,7 +79,6 @@ Here are brief explanations of each of these variables: - Rectified linear unit (**'relu'**), **'sigmoid'**, and **'linear'** *layer_type*s are currently supported - **mlp_in_layer_activ** - the layer type of the input layer: number of nodes is determined by data dimensions - **mlp_out_layer_activ** - the layer type of the output layer: number of nodes is determined by data dimensions -- **normals_use**: *boolean* to determine if I/O parameters should be normalized (min-max, between 0 and 1) - **project_name**: the name of your project - **project_num_builds**: the number of builds in your project - **project_num_nodes**: the number of nodes in each build @@ -92,49 +86,55 @@ Here are brief explanations of each of these variables: - **project_print_feedback**: whether the console will show status messages - **train_epochs**: number of training iterations (not used with validation) - **valid_max_epochs**: the maximum number of training iterations during the validation process -- **valid_mdrmse_memory**: how many epochs back the validation process looks in determining the change in validation RMSE over time -- **valid_mdrmse_stop**: the threshold to determine learning cutoff (looks at the change in validation RMSE over time) ## Server methods: Here is an overview of the Server object's methods: -- **create_save_env()**: creates the folder hierarchy for your project, contained in a folder named after your project name - - note: if this is not done, a project will not be created, and single models will be saved to the 'tmp' folder in your working directory -- **import_data(*data_filename = None*)**: imports the data from the database specified in 'data_filename', splits the data into learning/validation/testing groups, and packages the data so it's ready to be sent to the model +- **create_project(*project_name = None*)**: creates the folder hierarchy for your project + - project_name values: + - **None** (default config project name is used) + - **Other** (supplied project name is used for your project) + - note: if this is not called, a project will not be created, and single models will be saved to the 'tmp' folder in your working directory +- **import_data(*data_filename = None*)**: imports the data from the database specified in *data_filename*, splits the data into learning/validation/testing groups, and packages the data so it's ready to be sent to the model - data_filename values: - **None** (default config filename is used) - - **'database_path_string.csv'** (specified database at location is used) -- **fit_mlp_model(*args*)**: fits multilayer-perceptron(s) to the data, for 'train_epochs' learning iterations - - arguments: - - **None** (no re-shuffling between trials) - - **'shuffle_lv'** (shuffles learning and validation sets between trials) - - **'shuffle_lvt'** (shuffles all sets between trials) -- **fit_mlp_model_validation(*args*)**: fits multilayer-perceptron(s) to the data, using the validation set to determine when to stop learning + - **Other** (supplied CSV database will be used) +- **limit_parameters(*limit_num, output_filename, use_genetic = False, population_size = 100, num_survivors = 33, num_generations = 10*)**: reduces the input dimensionality of an input database to *limit_num* through a "retain the best" algorithm; saves the limited database to *output_filename*. If *use_genetic* is True, a genetic algorithm will be used instead of the retention algorithm; optional arguments for the genetic algorithm are: + - **population_size** (size of the genetic algorithm population) + - **num_survivors** (number of population members used to generate the next generation) + - **num_generations** (number of generations the genetic algorithm runs for) +- **tune_hyperparameters(*target_score = None, iteration_amt = 50, amt_employers = 50*)**: optimizes neural network hyperparameters (learning rate, maximum epochs during validation, neuron counts for each hidden layer) using an artifical bee colony algorithm + - arguments: + - **target_score** (specify target score for program to terminate) + - If *None*, ABC will run for *iteration_amt* iterations + - **iteration_amt** (specify how many iterations to run the colony) + - Only if *target_score* is not supplied + - **amt_employers** (specify the amount of employer bees in the colony) +- **train_model(*args, validate = False*)**: fits neural network(s) to the imported data - arguments: - **None** (no re-shuffling between trials) - **'shuffle_lv'** (shuffles learning and validation sets between trials) - **'shuffle_lvt'** (shuffles all sets between trials) -- **select_best(*args*)**: selects the best performing model (lowest RMSE on specified data set) to represent each node of each build; requires a folder hierarchy to be created - - arguments: + - If validate is **True**, the data's validation set will be used to periodically test model performance to determine when to stop learning; else, trains for *train_epochs* iterations +- **select_best(*dset = None, error_fn = 'rmse'*)**: selects the best performing neural network to represent each node of each build; requires a folder hierarchy to be created + - dset arguments: - **None** (best performers are based on entire database) - **'learn'** (best performers are based on learning set) - **'valid'** (best performers are based on validation set) - **'train'** (best performers are based on learning & validation sets) - **'test'** (best performers are based on test set) -- **use_mlp_model(*args*)**: predicts values for a specified data set; returns a list of results for each build - - arguments: + - error_fn arguments: + - **'rmse'** (RMSE is used as the metric to determine best performing neural network) + - **'mean_abs_error'** (Mean absolute error is used as the metric to determine best performing neural network) + - **'med_abs_error'** (Median absolute error is used as the metric to determine best performing neural network) +- **use_model(*dset = None*)**: predicts values for a specified data set; returns a list of results for each build + - dset arguments: - **None** (defaults to whole dataset) - **'learn'** (obtains results for learning set) - **'valid'** (obtains results for validation set) - **'train'** (obtains results for learning & validation sets) - **'test'** (obtains results for test set) -- **tune_hyperparameters(*target_score = None, iteration_amount = 50, amount_of_employers = 50*)**: optimize the hyperparameters - - argumnets: - - **None** (defaults to 50 iterations, 50 employers) - - **iteration_amount** (specify how many iterations to run the colony) - - **target_score** (specify target score for program to terminate) - - **amount_of_employers** (specify the amount of employer bees in the colony) - **calc_error(*args*, *dset = None*)**: calculates various metrics for error for a specified data set - arguments: - **'rmse'** (root-mean-squared error) @@ -147,16 +147,9 @@ Here is an overview of the Server object's methods: - **'valid'** (errors for validation set) - **'train'** (errors for learning & validation sets) - **'test'** (errors for test set) -- **output_results(*results, filename, dset = None*)**: saves your **results** to a specified output **filename** - - dset values: - - **None** (defaults to outputting entire dataset) - - **'learn'** (outputs data for learning set) - - **'valid'** (outputs data for validation set) - - **'train'** (outputs data for learning & validation sets) - - **'test'** (outputs data for test set) -- **limit_parameters(*param_num, filename*)**: reduces the input dimensionality of an input database to **param_num** through a "retain the best" process; saves to new database **filename** -- **publish_project()**: cleans the project directory, copies config, normal_params, and currently loaded dataset into the directory, and creates a '.project' file -- **open_project(*project_name*)**: opens a '**project_name**.project' file, importing the project's config, normal_params, and dataset to the Server object +- **output_results(*results, filename = 'my_results.csv'*)**: saves your **results** to a specified output **filename** +- **save_project(*clean_up = True*)**: cleans the project directory (if *clean_up* is True) (removes neural networks not selected via *select_best()*, copies config and currently loaded dataset into the directory, and creates a '.project' file +- **open_project(*filename*)**: opens a '*filename*.project' file, importing the project's config, dataset, and trained models to the Server object Working directly with the Server object to handle model creation and data management allows for speedy scripting, but you can still work with the model and data classes directly. View the source code [README.md](https://github.com/TJKessler/ECNet/tree/master/ecnet) for more information on low-level usage. @@ -172,33 +165,30 @@ from ecnet.server import Server # Create server object sv = Server() -# Create a folder structure for your project -sv.create_save_env() - -# Import data from file specified in config -sv.import_data() +# Create a project +sv.create_project('my_new_project') -# Fits model(s), shuffling learn and validate sets between trials -sv.fit_mlp_model_validation('shuffle_lv') +# Import data +sv.import_data('my_data.csv') -# Tunes hyperparameters to their optimal values -sv.tune_hyperparameters(iteration_amount = 150) +# Trains neural networks using periodic validation, shuffling learn and validate sets between trials +sv.train_model('shuffle_lv', validate = True) -# Select best trial from each build node to predict for the node -sv.select_best() +# Select best neural network from each build node (based on test set performance) to predict for the node +sv.select_best('test') # Predict values for the test data set test_results = sv.use_mlp_model('test') # Output results to specified file -sv.output_results(results = test_results, filename = 'test_results.csv', dset = 'test') +sv.output_results(results = test_results, filename = 'test_results.csv') # Calculates errors for the test set test_errors = sv.calc_error('rmse','r2','mean_abs_error','med_abs_error', dset = 'test') print(test_errors) -# Publish the project to a .project file -sv.publish_project() +# Save the project to a .project file +sv.save_project() ``` @@ -216,7 +206,7 @@ sv.vars['mlp_hidden_layers'] = [[32, 'relu'], [32, 'relu']] sv.vars['project_print_feedback'] = False ``` -Once you publish a project, the .project file can be opened and used for predictions: +Once you save a project, the .project file can be opened and used for predictions: ```python from ecnet.server import Server @@ -230,14 +220,14 @@ sv.open_project('my_project.project') sv.import_data('new_data.csv') # Save results to output file -# - NOTE: no 'dset' arguments for 'use_mlp_model' and 'output_results' defaults to using all data -sv.output_results(results = sv.use_mlp_model(), filename = 'new_data_results.csv') +# - NOTE: no 'dset' argument for 'use_model' defaults to using all currently loaded data +sv.output_results(results = sv.use_model(), filename = 'new_data_results.csv') ``` To view more examples of common ECNet tasks, view the [examples](https://github.com/TJKessler/ECNet/tree/master/examples) directory. ## Database Format: -ECNet databases are comma-separated value (CSV) formatted files that provide information such as the ID of each molecule (DATAid), an optional explicit sort type (T/V/L), various strings and groups to identify molecules, and input and output/target parameters. The number of strings, groups, outputs/targets and specific DATAID's to automatically drop are determined by the master parameters in rows 1-2. Row 3 contains the headers for each sub-section (DATAID, T/V/L, strings, groups, paramters), and row 4 contains specific string, group, and parameter names. The number of outputs/targets determined by the "Num of Outputs" master parameter tells the data importing software how many parameter columns (from left to right) are used as outputs/targets. +ECNet databases are comma-separated value (CSV) formatted files that provide information such as the ID of each molecule, an optional explicit sort type, various strings and groups to identify molecules, and output/target and input parameters. Row 1 is used to identify which columns are used for ID, sorting assignment, various strings and groups, and target and input data. The [databases](https://github.com/TJKessler/ECNet/tree/master/databases) directory contains databases for cetane number as well as a database template. @@ -247,4 +237,4 @@ To contribute to ECNet, make a pull request. Contributions should include tests To report problems with the software or feature requests, file an issue. When reporting problems, include information such as error messages, your OS/environment and Python version. -For additional support/questions, contact Travis Kessler (Travis_Kessler@student.uml.edu), Hernan Gelaf-Romer (hernan_gelafromer@student.uml.edu) or John Hunter Mack (Hunter_Mack@uml.edu). +For additional support/questions, contact Travis Kessler (travis.j.kessler@gmail.com), Hernan Gelaf-Romer (hernan_gelafromer@student.uml.edu) and/or John Hunter Mack (Hunter_Mack@uml.edu). diff --git a/ecnet/README.md b/ecnet/README.md index 9909e96..16fe4c8 100644 --- a/ecnet/README.md +++ b/ecnet/README.md @@ -1,50 +1,37 @@ # Low-level usage of model, data_utils, error_utils, limit_parameters, and abc ## model.py -#### Class: multilayer_perceptron +#### Class: MultilayerPerceptron Attributes: - **layers**: list of layers; layers contain information about the number of neurons and the activation function in form [num, func] - **weights**: list of TensorFlow weight variables - **biases**: list of TensorFlow bias variables Methods: -- **addLayer(num_neurons, activ_function)**: adds a layer to layers in form [num_neurons, activ_function] +- **add_layer(size, act_fn)**: appends a *Layer* to the MLP's layer list - supported activation functions: 'relu', 'sigmoid', 'linear' -- **connectLayers()**: initializes TensorFlow variables for weights and biases between each layer; fully connected -- **feed_forward(x)**: used by TensorFlow graph to feed data through weights and add biases -- **fit(x_l, y_l, learning_rate, train_epochs)**: fits the model to the inputs (**x_l**) and outputs (**y_l**) for **train_epochs** iterations with a learning rate of **learning_rate** -- **fit_validation(x_l, x_v, y_l, y_v, learning_rate, mdrmse_stop, mdrmse_memory, max_epochs)**: fits the model while using a validation set in order to test the learning performance over time - - *mdrmse*: mean-delta-root-mean-squared error, or the change in the difference between RMSE values over time - - **mdrmse_stop** is the cutoff point, where the function ceases learning (mdrmse approaches zero as epochs increases) - - **mdrmse_memory** is used to determine how far back (number of epochs) the function looks in determining mdrmse +- **connect_layers()**: initializes TensorFlow variables for weights and biases between each layer; fully connected +- **fit(x_l, y_l, learning_rate, train_epochs)**: fits the MLP to the inputs (**x_l**) and outputs (**y_l**) for **train_epochs** iterations with a learning rate of **learning_rate** +- **fit_validation(x_l, x_v, y_l, y_v, learning_rate, max_epochs)**: fits the MLP, periodically checking MLP performance using validation data; learning is stopped when validation data performance stops improving - **max_epochs** is the cutoff point if mdrmse has not fallen below mdrmse_stop -- **test_new(x)**: used to pass data through the model to get a prediction, without training it; returns predicted values -- **save_net(output_filepath)**: saves the TensorFlow session (.sess) and model architecture information (.struct) to specified filename -- **load_net(model_load_filename)**: opens a TensorFlow session (.sess) and model architecture information (.struct) to work with -- **export_weights()**: returns numerical versions of the model's TensorFlow weight variables -- **export_biases()**: returns numerical versions of the model's TensorFlow bias variables - -Misc. Functions: -- **calc_valid_rmse(x, y)**: calculates the root-mean-squared error during 'fit_validation()' +- **use(x)**: used to pass data through the trained model to get a prediction; returns predicted values +- **save(filepath)**: saves the TensorFlow session (.sess) and model architecture information (.struct) to specified filename +- **load(filepath)**: opens a TensorFlow session (.sess) and model architecture information (.struct) from specified filename ## data_utils.py -#### Class: initialize_data(data_filename) +#### Class: DataFrame Methods: -- **build()**: imports a formatted database, parses controls, sets up groupings and data I/O locations -- **normalize(param_filepath)**: will normalize the input and output data to [0,1] using min-max normalization; saves a file containing normalization parameters -- **apply_normal(param_filepath)**: will apply the normalization paramters found in the specified file to un-normalized data -- **buildTVL(sort_type, data_split)**: builds the test, validation and learn sets using 'random' or 'explicit' sort types - - supported sort types: 'random', 'explicit' - - data_split format: [0.L, 0.V, 0.T] - sum of 0.L, 0.V and 0.T = 1 -- **randomizeData(randIndex, data_split)**: used by 'buildTVL' to randomly assign test, validation and learn indices, which will be applied to each data input -- **applyTVL()**: applies the test, validation and learn indices to each data input -- **package()**: packages the data, so it can be handed to a machine learning model +- **__init__(filename)**: imports a formatted database, creates DataPoints for each data entry, grabs string and group names and counts +- **create_sets(random = True, split = [0.7, 0.2, 0.1]**: create learning, validation and testing sets with *split* proportions; if random = False, database set assignments are used +- **create_sorted_sets(sort_string, split = [0.7, 0.2, 0.1]**: using *sort_string*, a string contained in the given database, assigns proportions *split* of each possible string value to learning, validation and testing sets +- **shuffle(args, split = [0.7, 0.2, 0.1])**: shuffles data for specified sets + - args combinations: + - 'l, v, t' (shuffles data for learning, validation and testing sets) + - 'l, v' (shuffles data for learning and validation sets) +- **package_sets()** returns a PackagedData object, containing NumPy arrays for learning, validation and testing input and target sets -Misc. Functions: -- **create_static_test_set(data)**: taking an initialize_data object, this function will create separate files for the test and learning/validation data; useful for when you need a static test set for completely blind model testing -- **output_results(results, data, filename)**: outputs your prediction results from your model to a specified filename - - arguments are a list of results obtained from model.py, a data object from data_utils.py, and the filename to save to -- **denormalize_result(results, param_filepath)**: denormalizes a result, using min-max normalization paramters found in the param_filepath; returns denormalized results list +Functions: +- **output_results(results, DataFrame, filename)**: outputs *results* (calculated by model.py for a specified data set) to *filename*; *DataFrame* is required for outputting data entry names, strings, groups, etc. ## error_utils.py Notation: @@ -59,22 +46,5 @@ Error Functions: ## limit_parameters.py Functions: -- **limit(num_params, server)**: limits the number of input parameters to an integer value specified by num_params, using a "retain the best" process, where the best performing input parameter (based on RMSE) is retained, paired with every other input parameter until a best pair is found, repeated until the limit number has been reached - - returns a list of parameters -- **output(data, param_list, filename)**: saves a new .csv formatted database, using a generated parameter list and an output filename - -## abc.py -#### Class: ABC -Attributes: -- **valueRanges**: a list of tuples of value types to value range (value_type, (value_min, value_max)) -- **fitnessFunction**: fitness function to evaluate a set of values; must take one parameter, a list of values -- **endValue**: target fitness score which will terminate the program when reached -- **iterationAmount**: amount of iterations before terminating program -- **amountOfEmployers**: amount of sets of values stored per iteration - -Methods: -- **assignNewPositions(firstBee)**: assign a new position to a given bee -- **getFitnessAverage()**: collect the average of all the fitness scores across all employer bees -- **checkNewPosition(bee)**: Check if the new position is better than the fitness average, if it is, assign it to the bee -- **checkIfDone()**: Check if the best fitness score is lower than the target score to terminate the program; only valid if the argument endValue was assigned a value -- **runABC()**: run the artificial bee colony based on the arguments passed to the constructor. Must pass a fitness function and either a target fitness score or target iteration number in order to specify when the program will terminate. Must also specify value types/ranges. +- **limit_iterative_include(DataFrame, limit_num)**: limits the input dimensionality of data found in *DataFrame* to a dimensionality of *limit_num* using a "retain the best" algorithm +- **limit_genetic(DataFrame, limit_num, population_size, num_survivors, num_generations, print_feedback)**: limits the input dimensionality of data found in *DataFrame* to a dimensionality of *limit_num* using a genetic algorithm; *population_size* indicates the number of members for each generation, *num_survivors* indicates how many members of each generation survive, *num_generations* indicates how many generations the genetic algorithm runs for, and *print_feedback* is a boolean for the genetic algorithm to periodically print status updates diff --git a/ecnet/__init__.py b/ecnet/__init__.py index 93968af..376e6ae 100644 --- a/ecnet/__init__.py +++ b/ecnet/__init__.py @@ -3,4 +3,4 @@ import ecnet.error_utils import ecnet.model import ecnet.limit_parameters -import ecnet.abc +__version__ = '1.4.0' diff --git a/ecnet/abc.py b/ecnet/abc.py deleted file mode 100644 index 11bf23f..0000000 --- a/ecnet/abc.py +++ /dev/null @@ -1,178 +0,0 @@ -#!/usr/bin/env python -# -*- coding: utf-8 -*- -# -# ecnet/abc.py -# v.1.3.0.dev1 -# Developed in 2018 by Hernan Gelaf-Romer -# -# This program implements an artificial bee colony to tune ecnet hyperparameters -# - -# 3rd party packages (open src.) -from random import randint -import numpy as np -import sys as sys - -### Artificial bee colony object, which contains multiple bee objects ### -class ABC: - - def __init__(self, valueRanges, fitnessFunction=None, endValue = None, iterationAmount = None, amountOfEmployers = 50): - if endValue == None and iterationAmount == None: - raise ValueError("must select either an iterationAmount or and endValue") - if fitnessFunction == None: - raise ValueError("must pass a fitness function") - print("***INITIALIZING***") - self.valueRanges = valueRanges - self.fitnessFunction = fitnessFunction - self.employers = [] - self.bestValues = [] # Store the values that are currently performing the best - self.onlooker = Bee('onlooker') - self.bestFitnessScore = None # Store the current best Fitness Score - self.fitnessAverage = 0 - self.endValue = endValue - self.iterationAmount = iterationAmount - # Initialize employer bees, assign them values/fitness scores - for i in range(amountOfEmployers): - sys.stdout.flush() - sys.stdout.write("Creating bee number: %d \r" % (i + 1)) - self.employers.append(Bee('employer', generateRandomValues(self.valueRanges))) - self.employers[i].currFitnessScore = self.fitnessFunction(self.employers[i].values) - print("***DONE INITIALIZING***") - - ### Assign a new position to the given bee - def assignNewPositions(self, firstBee): - valueTypes = [t[0] for t in self.valueRanges] - secondBee = randint(0, len(self.employers) -1) - # Avoid both bees being the same - while (secondBee == firstBee): - secondBee = randint(0, len(self.employers) -1) - self.onlooker.getPosition(self.employers, firstBee, secondBee, self.fitnessFunction, valueTypes) - - ### Collect the average fitness score across all employers - def getFitnessAverage(self): - self.fitnessAverage = 0 - for employer in self.employers: - self.fitnessAverage += employer.currFitnessScore - # While iterating through employers, look for the best fitness score/value pairing - if self.bestFitnessScore == None or employer.currFitnessScore < self.bestFitnessScore: - self.bestFitnessScore = employer.currFitnessScore - self.bestValues = employer.values - self.fitnessAverage /= len(self.employers) - - ### Check if new position is better than current position held by a bee - def checkNewPositions(self, bee): - # Update the bee's fitness/value pair if the new location is better - if bee.currFitnessScore > self.fitnessAverage: - bee.values = generateRandomValues(self.valueRanges) - bee.currFitnessScore = self.fitnessFunction(bee.values) - - ### If termination depends on a target value, check to see if it has been reached - def checkIfDone(self, count): - keepGoing = True - if self.endValue != None: - for employer in self.employers: - if employer.currFitnessScore <= self.endValue: - print("Fitness score =", employer.currFitnessScore) - print("Values =", employer.values) - keepGoing = False - elif count >= self.iterationAmount: - keepGoing = False - return keepGoing - - ### Run the artificial bee colony - def runABC(self): - running = True - count = 0 - - while True: - print("Assigning new positions") - for i in range(len(self.employers)): - sys.stdout.flush() - sys.stdout.write('At bee number: %d \r' % (i+1)) - self.assignNewPositions(i) - print("Getting fitness average") - self.getFitnessAverage() - print("Checking if done") - count+=1 - running = self.checkIfDone(count) - if running == False and self.endValue != None: - saveScore(self.bestFitnessScore, self.bestValues) - break - print("Current fitness average:", self.fitnessAverage) - print("Checking new positions, assigning random positions to bad ones") - for employer in self.employers: - self.checkNewPositions(employer) - print("Best score:", self.bestFitnessScore) - print("Best value:", self.bestValues) - if self.iterationAmount != None: - print("Iteration {} / {}".format(count, self.iterationAmount)) - if running == False: - saveScore(self.bestFitnessScore, self.bestValues) - break - saveScore(self.bestFitnessScore, self.bestValues) - - return self.bestValues - - -### Bee object, employers contain value/fitness -class Bee: - - def __init__(self, beeType, values=[]): - self.beeType = beeType - # Only the employer bees should store values/fitness scores - if beeType == "employer": - self.values = values - self.currFitnessScore = None - - ### Onlooker bee function, create a new set of positions - def getPosition(self, beeList, firstBee, secondBee, fitnessFunction, valueTypes): - newValues = [] - currValue = 0 - for i in range(len(valueTypes)): - currValue = valueFunction(beeList[firstBee].values[i], beeList[secondBee].values[i]) - if valueTypes[i] == 'int': - currValue = int(currValue) - newValues.append(currValue) - beeList[firstBee].getFitnessScore(newValues, fitnessFunction) - - #### Employer bee function, get fitness score for a given set of values - def getFitnessScore(self, values, fitnessFunction): - if self.beeType != "employer": - raise RuntimeError("Cannot get fitness score on a non-employer bee") - else: - # Your fitness function must take a certain set of values that you would like to optimize - fitnessScore = fitnessFunction(values) - if self.currFitnessScore == None or fitnessScore < self.currFitnessScore: - self.value = values - self.currFitnessScore = fitnessScore - -### Private functions to be called by ABC - -### Generate a random set of values given a value range -def generateRandomValues(value_ranges): - values = [] - if value_ranges == None: - raise RuntimeError("must set the type/range of possible values") - else: - # t[0] contains the type of the value, t[1] contains a tuple (min_value, max_value) - for t in value_ranges: - if t[0] == 'int': - values.append(randint(t[1][0], t[1][1])) - elif t[0] == 'float': - values.append(np.random.uniform(t[1][0], t[1][1])) - else: - raise RuntimeError("value type must be either an 'int' or a 'float'") - return values - -### Method of generating a value in between the values given -def valueFunction(a, b): - activationNum = np.random.uniform(-1, 1) - return a + abs(activationNum * (a - b)) - -### Function for saving the scores of each iteration onto a file -def saveScore(score, values, filename = 'scores.txt'): - f = open(filename, 'a') - string = "Score: {} Values: {}".format(score, values) - f.write(string) - f.write('\n') - f.close() diff --git a/ecnet/data_utils.py b/ecnet/data_utils.py index 380c12d..8465b1a 100644 --- a/ecnet/data_utils.py +++ b/ecnet/data_utils.py @@ -2,537 +2,382 @@ # -*- coding: utf-8 -*- # # ecnet/data_utils.py -# v.1.3.0.dev1 -# Developed in 2018 by Travis Kessler +# v.1.4.0 +# Developed in 2018 by Travis Kessler # -# This program contains the data object class, and functions for manipulating/importing/outputting data +# This program contains the "DataFrame" class, and functions for processing/importing/outputting +# data. High-level usage is handled by the "Server" class in server.py. For low-level +# usage explanations, refer to https://github.com/tjkessler/ecnet # +# 3rd party packages (open src.) import csv -import random -import pickle import numpy as np -from math import sqrt -import math -import sys -import copy -import warnings - -# Creates a static test set, as well as a static learning/validation set with remaining data -def create_static_test_set(data): - filename = data.file.split(".")[0] - # Header setup - control_row_1 = ["NUM OF MASTER"] - for i in range(0,len(data.controls_param_cols)): - control_row_1.append(data.controls_param_cols[i]) - control_row_2 = [data.controls_m_param_count] - for i in range(0,len(data.control_params)): - control_row_2.append(data.control_params[i]) - row_3 = ["DATAID", "T/V/L/U"] - row_4 = ["DATAid", "T/V/L"] - if data.controls_num_str != 0: - row_3.append("STRINGS") - for i in range(0,data.controls_num_str - 1): - row_3.append(" ") - for i in range(0,len(data.string_cols)): - row_4.append(data.string_cols[i]) - if data.controls_num_grp != 0: - row_3.append("GROUPS") - for i in range(0,data.controls_num_grp - 1): - row_3.append(" ") - for i in range(0,len(data.group_cols)): - row_4.append(data.group_cols[i]) - row_3.append("PARAMETERS") - for i in range(0,len(data.param_cols)): - row_4.append(data.param_cols[i]) - # Test set file - test_rows = [] - test_rows.append(control_row_1) - test_rows.append(control_row_2) - test_rows.append(row_3) - test_rows.append(row_4) - for i in range(0,len(data.test_dataid)): - local_row = [data.test_dataid[i], "T"] - for j in range(0,len(data.test_strings[i])): - local_row.append(data.test_strings[i][j]) - for j in range(0,len(data.test_groups[i])): - local_row.append(data.test_groups[i][j]) - for j in range(0,len(data.test_params[i])): - local_row.append(data.test_params[i][j]) - test_rows.append(local_row) - with open(filename + "_st.csv", 'w') as output_file: - wr = csv.writer(output_file, quoting = csv.QUOTE_ALL, lineterminator = '\n') - for row in range(0,len(test_rows)): - wr.writerow(test_rows[row]) - # Learning/validation set file - lv_rows = [] - lv_rows.append(control_row_1) - lv_rows.append(control_row_2) - lv_rows.append(row_3) - lv_rows.append(row_4) - for i in range(0,len(data.learn_dataid)): - local_row = [data.learn_dataid[i], "L"] - for j in range(0,len(data.learn_strings[i])): - local_row.append(data.learn_strings[i][j]) - for j in range(0,len(data.learn_groups[i])): - local_row.append(data.learn_groups[i][j]) - for j in range(0,len(data.learn_params[i])): - local_row.append(data.learn_params[i][j]) - lv_rows.append(local_row) - for i in range(0,len(data.valid_dataid)): - local_row = [data.valid_dataid[i], "V"] - for j in range(0,len(data.valid_strings[i])): - local_row.append(data.valid_strings[i][j]) - for j in range(0,len(data.valid_groups[i])): - local_row.append(data.valid_groups[i][j]) - for j in range(0,len(data.valid_params[i])): - local_row.append(data.valid_params[i][j]) - lv_rows.append(local_row) - with open(filename + "_slv.csv", 'w') as output_file: - wr = csv.writer(output_file, quoting = csv.QUOTE_ALL, lineterminator = '\n') - for row in range(0,len(lv_rows)): - wr.writerow(lv_rows[row]) - -# Saves test results, data strings, groups to desired output .csv file -def output_results(results, data, filename, dset): - # Makes sure filetype is csv - if ".csv" not in filename: - filename = filename + ".csv" - # List of all rows - rows = [] - # Title row, containing column names - title_row = [] - title_row.append("DataID") - for string in range(0,len(data.string_cols)): - title_row.append(data.string_cols[string]) - for group in range(0,len(data.group_cols)): - title_row.append(data.group_cols[group]) - title_row.append("DB Value") - for i in range(data.controls_num_outputs - 1): - title_row.append('') - for i in range(0,len(results)): - title_row.append("Predicted Value %d" %(i+1)) - if data.controls_num_outputs > 1: - for output in range(data.controls_num_outputs - 1): - title_row.append("") - rows.append(title_row) +import random as rm - # Determines which data set the results are from - which_data = dset - - # Format training data outputs - if which_data is 'train': - train_dataid = [] - train_strings = [] - train_groups = [] - train_y = [] - for i in range(len(data.learn_dataid)): - train_dataid.append(data.learn_dataid[i]) - for i in range(len(data.valid_dataid)): - train_dataid.append(data.valid_dataid[i]) - for i in range(len(data.learn_strings)): - train_strings.append(data.learn_strings[i]) - for i in range(len(data.valid_strings)): - train_strings.append(data.valid_strings[i]) - for i in range(len(data.learn_groups)): - train_groups.append(data.learn_groups[i]) - for i in range(len(data.valid_groups)): - train_groups.append(data.valid_groups[i]) - for i in range(len(data.learn_y)): - train_y.append(data.learn_y[i]) - for i in range(len(data.valid_y)): - train_y.append(data.valid_y[i]) - - # Adds data ID's, strings, groups, DB values and predictions for each test result to the rows list - for result in range(0,len(results[0])): - local_row = [] - # Export learning data results - if which_data is 'learn': - local_row.append(data.learn_dataid[result]) - for string in range(0,len(data.learn_strings[result])): - local_row.append(data.learn_strings[result][string]) - for group in range(0,len(data.learn_groups[result])): - local_row.append(data.learn_groups[result][group]) - for i in range(len(data.learn_y[result])): - local_row.append(data.learn_y[result][i]) - # Export validation data results - elif which_data is 'valid': - local_row.append(data.valid_dataid[result]) - for string in range(0,len(data.valid_strings[result])): - local_row.append(data.valid_strings[result][string]) - for group in range(0,len(data.valid_groups[result])): - local_row.append(data.valid_groups[result][group]) - for i in range(len(data.valid_y[result])): - local_row.append(data.valid_y[result][i]) - # Export testing data results - elif which_data is 'test': - local_row.append(data.test_dataid[result]) - for string in range(0,len(data.test_strings[result])): - local_row.append(data.test_strings[result][string]) - for group in range(0,len(data.test_groups[result])): - local_row.append(data.test_groups[result][group]) - for i in range(len(data.test_y[result])): - local_row.append(data.test_y[result][i]) - # Export training data results - elif which_data is 'train': - local_row.append(train_dataid[result]) - for string in range(0,len(train_strings[result])): - local_row.append(train_strings[result][string]) - for group in range(0,len(train_groups[result])): - local_row.append(train_groups[result][group]) - for i in range(len(train_y[result])): - local_row.append(train_y[result][i]) - # Export all data results - elif which_data is None: - local_row.append(data.dataid[result]) - for string in range(0,len(data.strings[result])): - local_row.append(data.strings[result][string]) - for group in range(0,len(data.groups[result])): - local_row.append(data.groups[result][group]) - for i in range(len(data.y[result])): - local_row.append(data.y[result][i]) - # Append predicted values - for i in range(len(results)): - for j in range(len(results[i][result])): - local_row.append(results[i][result][j]) - rows.append(local_row) - # Outputs each row to the output file - with open(filename, 'w') as output_file: - wr = csv.writer(output_file, quoting = csv.QUOTE_ALL, lineterminator = '\n') - for row in range(0,len(rows)): - wr.writerow(rows[row]) - -# Denormalizes resultant data using parameter file -def denormalize_result(results, param_filepath): - norm_file = open(param_filepath + ".ecnet","rb") - normalParams = pickle.load(norm_file) - norm_file.close() - dn_res = copy.copy(results) - try: - for i in range(0,len(dn_res[0])): - for j in range(0,len(dn_res)): - dn_res[j][i] = (dn_res[j][i]*normalParams[i][1])-normalParams[i][0] - return(dn_res) - except: - return [] - -### Initial definition of data object -class initialize_data: - def __init__(self, data_filename): - self.file = data_filename - - # Opening excel (csv) file, and parsing initial data - def build(self): - if(".xlsx" in self.file): - print(".xlsx file format detected. Please reformat as '.csv'.") - sys.exit() - elif(".csv" in self.file): - with open(self.file, newline='') as csvfile: - fileRaw = csv.reader(csvfile) - fileRaw = list(fileRaw) - # generates a raw list/2D Array for rows + cols of csv file; - # i.e. cell A1 = [0][0], A2 = [1][0], B2 = [1][1], etc. - else: - print("Error: Unsupported file format") - sys.exit() - - # parse master parameters from .csv file - self.controls_m_param_count = int(fileRaw[1][0]) # num of master parameters, defined by A2 - self.controls_param_cols = fileRaw[0][1:1+self.controls_m_param_count] # ROW 1 - self.control_params = fileRaw[1][1:1+self.controls_m_param_count] # ROW 2 - self.controls_num_str = int(self.control_params[0]) - self.controls_num_grp = int(self.control_params[1]) - self.controls_num_outputs = int(self.control_params[4]) - - # parse column names from .csv file - self.string_cols = fileRaw[3][2:2+self.controls_num_str] - self.group_cols = fileRaw[3][2+self.controls_num_str:2+self.controls_num_str+self.controls_num_grp] - self.param_cols = fileRaw[3][2+self.controls_num_str+self.controls_num_grp:-1] - (self.param_cols).append(fileRaw[3][-1]) - - # parse data from .csv file - self.dataid = [sublist[0] for sublist in fileRaw] - del self.dataid[0:4] #removal of title rows - self.strings = [sublist[2:2+self.controls_num_str] for sublist in fileRaw] - del self.strings[0:4] #removal of title rows - self.groups = [sublist[2+self.controls_num_str:2+self.controls_num_str+self.controls_num_grp] for sublist in fileRaw] - del self.groups[0:4] #removal of title rows - self.params = [sublist[2+self.controls_num_str+self.controls_num_grp:-1] for sublist in fileRaw] - del self.params[0:4] #removal of title rows - params_last = [sublist[-1] for sublist in fileRaw] - del(params_last[0:4]) - for i in range(0,len(self.params)): - self.params[i].append(params_last[i]) - - # parse T/V/L data - self.tvl_strings = [sublist[1] for sublist in fileRaw] - del self.tvl_strings[0:4] #removal of title rows - - # Drop any data from data set defined in 'Data to AUTOMATICALLY DROP' or Unreliable in csv file - dropListIndex = self.control_params[self.controls_param_cols.index("Data to AUTOMATICALLY DROP")] +''' +DataFrame class: Handles importing data from formatted CSV database, determining learning, +validation and testing sets, and packages sets as Numpy arrays for hand-off to models +''' +class DataFrame: + + ''' + Private DataPoint class: Contains all information for each data entry found in CSV database + ''' + class __DataPoint: + + def __init__(self): + + self.id = None + self.assignment = None + self.strings = [] + self.groups = [] + self.targets = [] + self.inputs = [] + + ''' + Initializes object, creates *DataPoint*s for each data entry + ''' + def __init__(self, filename): + + # Make sure filename is CSV + if not '.csv' in filename: + filename = filename + '.csv' + # Open the database file try: - drop_remaining = dropListIndex.split( ) - while len(drop_remaining) != 0: - dropRowNum = self.dataid.index(drop_remaining[0]) - del self.dataid[dropRowNum] - del self.strings[dropRowNum] - del self.groups[dropRowNum] - del self.params[dropRowNum] - del self.tvl_strings[dropRowNum] - del drop_remaining[0] - except: - pass - self.unreliable = [] - for i in range(0,len(self.dataid)): - if (self.tvl_strings[i]).startswith("U"): # deleting predetermined unreliable data - (self.unreliable).append(i) - for i in range(0,len(self.unreliable)): - del self.dataid[self.unreliable[-1]] - del self.strings[self.unreliable[-1]] - del self.groups[self.unreliable[-1]] - del self.params[self.unreliable[-1]] - del self.tvl_strings[self.unreliable[-1]] - del self.unreliable[-1] - # End of building - - # Normalizing the parameter data to be within range [0,1] for each parameter. For use with sigmoidal activation functions only. - def normalize(self, param_filepath = 'normalParams'): - minMaxList = [] - for i in range(0,len(self.params[0])): - beforeNormal = [sublist[i] for sublist in self.params] - beforeNormal = np.matrix(beforeNormal).astype(np.float).reshape(-1,1) - minVal = beforeNormal.min() - minAdjust = 0 - minVal # IMPORTANT VARIABLE for de-normalizing predicted data into final predicted format - for a in range(0,len(beforeNormal)): - beforeNormal[a] = beforeNormal[a] + minAdjust - maxVal = beforeNormal.max() # IMPORTANT VARIABLE for de-normalizing predicted data into final predicted format - minMaxList.append([minAdjust,maxVal]) - for b in range(0,len(beforeNormal)): - if maxVal != 0: - beforeNormal[b] = beforeNormal[b]/maxVal - else: - beforeNormal[b] = 0 - if i is 0: - normalized_list = beforeNormal - maxVal = maxVal - else: - normalized_list = np.column_stack([normalized_list,beforeNormal]) - normalized_list = normalized_list.tolist() - self.params = normalized_list - normal_file = open(param_filepath + ".ecnet", "wb") - pickle.dump(minMaxList,normal_file) # Saves the parameter list for opening in data normalization and predicting - normal_file.close() - - # Applying normalizing parameters using parameter file to new (unseen) data. Based on previously build network. - def applyNormal(self, param_filepath = 'normalParams'): - normal_file = open(param_filepath + ".ecnet", "rb") - self.normalParams = pickle.load(normal_file) - normal_file.close() - normalized_params = [] - warnings.filterwarnings("ignore") - for i in range(0,len(self.params)): - paramBeforeNorm = [] - for j in range(0,len(self.params[i])): - paramBeforeNorm.append(float(self.params[i][j])) - normParamsAdd = [] - for j in range(0,len(paramBeforeNorm)): - if math.isnan((paramBeforeNorm[j] + self.normalParams[j][0]) / self.normalParams[j][1]): - normParamsAdd.append(0) - else: - normParamsAdd.append((paramBeforeNorm[j] + self.normalParams[j][0]) / self.normalParams[j][1]) - normalized_params.append(normParamsAdd) - warnings.filterwarnings("default") - for i in range(0,len(self.params)): - self.params[i] = normalized_params[i] - - # Defining data for test, validation or learning based on either imported file or randomization - def buildTVL(self, sort_type = 'random', data_split = [0.65, 0.25, 0.1]): - self.testIndex = [] - self.validIndex = [] - self.learnIndex = [] - if 'explicit' in sort_type: - for i in range(0,len(self.dataid)): - if (self.tvl_strings[i]).startswith("T"): - (self.testIndex).append(i) - if (self.tvl_strings[i]).startswith("V"): - (self.validIndex).append(i) - if (self.tvl_strings[i]).startswith("L"): - (self.learnIndex).append(i) - elif 'random' in sort_type: - randIndex = random.sample(range(len(self.dataid)),len(self.dataid)) - self.randomizeData(randIndex, data_split) + with open(filename, newline = '') as file: + data_raw = csv.reader(file) + data_raw = list(data_raw) + # Database not found! + except FileNotFoundError: + raise Exception('ERROR: Supplied file not found in working directory') + + # Append each database data point to DataFrame's data_point list + self.data_points = [] + for point in range(2, len([sublist[0] for sublist in data_raw])): + # Define data point + new_point = self.__DataPoint() + # Set data point's id + new_point.id = [sublist[0] for sublist in data_raw][point] + # Set data point's assignment + new_point.assignment = [sublist[1] for sublist in data_raw][point] + for header in range(len(data_raw[0])): + # Append data point strings + if 'STRING' in data_raw[0][header]: + new_point.strings.append(data_raw[point][header]) + # Append data point groups + elif 'GROUP' in data_raw[0][header]: + new_point.groups.append(data_raw[point][header]) + # Append data point target values + elif 'TARGET' in data_raw[0][header]: + new_point.targets.append(data_raw[point][header]) + # Append data point input values + elif 'INPUT' in data_raw[0][header]: + new_point.inputs.append(data_raw[point][header]) + + # Append to data_point list + self.data_points.append(new_point) + + # Obtain string, group, target, input header names + self.string_names = [] + self.group_names = [] + self.target_names = [] + self.input_names = [] + for header in range(len(data_raw[0])): + if 'STRING' in data_raw[0][header]: + self.string_names.append(data_raw[1][header]) + if 'GROUP' in data_raw[0][header]: + self.group_names.append(data_raw[1][header]) + if 'TARGET' in data_raw[0][header]: + self.target_names.append(data_raw[1][header]) + if 'INPUT' in data_raw[0][header]: + self.input_names.append(data_raw[1][header]) + + # Helper variables for determining number of strings, groups, targets, inputs + self.num_strings = len(self.string_names) + self.num_groups = len(self.group_names) + self.num_targets = len(self.target_names) + self.num_inputs = len(self.input_names) + + ''' + DataFrame class length = number of data_points + ''' + def __len__(self): + + return len(self.data_points) + + ''' + Creates learning, validation and test sets + random: *True* for random assignments, *False* for explicit (point defined) assignments + split: if random == *True*, split[0] = learning, split[1] = validation, split[2] = test + (proportions) + ''' + def create_sets(self, random = True, split = [0.7, 0.2, 0.1]): + + # Define sets + self.learn_set = [] + self.valid_set = [] + self.test_set = [] + + # If using random assignments + if random: + # Create random indices for sets + rand_index = rm.sample(range(len(self)), len(self)) + learn_index = rand_index[0 : int(len(rand_index) * split[0])] + valid_index = rand_index[int(len(rand_index) * split[0]) : int(len(rand_index) * (split[0] + split[1]))] + test_index = rand_index[int(len(rand_index) * (split[0] + split[1])) : -1] + test_index.append(rand_index[-1]) + + # Append data points to sets, set assignment to RAND (random assignment) + for idx in learn_index: + self.data_points[idx].assignment = 'L' + self.learn_set.append(self.data_points[idx]) + for idx in valid_index: + self.data_points[idx].assignment = 'V' + self.valid_set.append(self.data_points[idx]) + for idx in test_index: + self.data_points[idx].assignment = 'T' + self.test_set.append(self.data_points[idx]) + + # Using explicit (point defined) assignments else: - print("Error: unknown sort_type method, no splitting done.") - sys.exit() - - # Randomizes T/V/L lists based on splitting percentage variables - def randomizeData(self, randIndex, data_split): - if data_split[2] != 0: - if data_split[2] != 1: - for i in range(0,round(data_split[2]*len(randIndex))): - (self.testIndex).append(randIndex[i]) - del randIndex[i] - else: - for i in range(0,len(randIndex)): - (self.testIndex).append(randIndex[i]) - randIndex = [] - (self.testIndex).sort() - if data_split[1] != 0: - if data_split[1] != 1: - for i in range(0,round(data_split[1]*len(randIndex))): - (self.validIndex).append(randIndex[i]) - del randIndex[i] + # Append data points to sets based on explicit assignments + for point in self.data_points: + if point.assignment == 'L': + self.learn_set.append(point) + elif point.assignment == 'V': + self.valid_set.append(point) + elif point.assignment == 'T': + self.test_set.append(point) + + ''' + Creates learning, validation and test sets containing specified proportions + (*split*) of each *sort_string* element (*sort_string* can be any STRING value + found in your database file) + ''' + def create_sorted_sets(self, sort_string, split = [0.7, 0.2, 0.1]): + + # Obtain index of *sort_string* from DataFrame's string names + string_idx = self.string_names.index(sort_string) + # Sort DataPoints by specified string + self.data_points.sort(key = lambda x: x.strings[string_idx]) + + # List containing all possible values from *sort_string* string + string_vals = [] + # Groups for each distinct string value, containing DataPoints + string_groups = [] + + # Find all string values in *sort_string*, add/create string_val and string_group entries + for point in self.data_points: + if point.strings[string_idx] not in string_vals: + string_vals.append(point.strings[string_idx]) + string_groups.append([point]) else: - for i in range(0,len(randIndex)): - (self.validIndex).append(randIndex[i]) - randIndex = [] - (self.validIndex).sort() - self.learnIndex = randIndex - (self.learnIndex).sort() - - # Shuffles specified sets - def shuffle(self, *args, data_split): - if ('l' or 'learn') and ('v' or 'validate') and ('t' or 'test') in args: - self.buildTVL('random', data_split) - self.applyTVL() - self.package() - elif ('l' or 'learn') and ('v' or 'validate') in args: - lv_dataid = [] - lv_params = [] - lv_strings = [] - lv_groups = [] - for i in range(0,len(self.learn_dataid)): - lv_dataid.append(self.learn_dataid[i]) - lv_params.append(self.learn_params[i]) - lv_strings.append(self.learn_strings[i]) - lv_groups.append(self.learn_groups[i]) - for i in range(0,len(self.valid_dataid)): - lv_dataid.append(self.valid_dataid[i]) - lv_params.append(self.valid_params[i]) - lv_strings.append(self.valid_strings[i]) - lv_groups.append(self.valid_groups[i]) - randIndex = random.sample(range(len(lv_dataid)),len(lv_dataid)) - - new_learn_index = [] - for i in range(len(self.learn_dataid)): - new_learn_index.append(randIndex[i]) - - new_valid_index = [] - for i in range(len(self.valid_dataid)): - new_valid_index.append(randIndex[-(i+1)]) - - new_learn_index.sort() - new_valid_index.sort() - self.valid_dataid = [] - self.valid_params = [] - self.valid_strings = [] - self.valid_groups = [] - self.learn_dataid = [] - self.learn_params = [] - self.learn_strings = [] - self.learn_groups = [] - for i in range(len(new_learn_index)): - (self.learn_dataid).append(self.dataid[new_learn_index[i]]) - (self.learn_params).append(self.params[new_learn_index[i]]) - (self.learn_strings).append(self.strings[new_learn_index[i]]) - (self.learn_groups).append(self.groups[new_learn_index[i]]) - for i in range(len(new_valid_index)): - (self.valid_dataid).append(self.dataid[new_valid_index[i]]) - (self.valid_params).append(self.params[new_valid_index[i]]) - (self.valid_strings).append(self.strings[new_valid_index[i]]) - (self.valid_groups).append(self.groups[new_valid_index[i]]) - self.package() - else: - print('Error: set shuffling arguments must be all sets ("learn", "validate", "test"), or training sets ("learn", "validate")') - - # Application of index values to data - def applyTVL(self): - self.test_dataid = [] - self.test_params = [] - self.test_strings = [] - self.test_groups = [] - self.valid_dataid = [] - self.valid_params = [] - self.valid_strings = [] - self.valid_groups = [] - self.learn_dataid = [] - self.learn_params = [] - self.learn_strings = [] - self.learn_groups = [] - for i in range(0,len(self.testIndex)): - (self.test_dataid).append(self.dataid[self.testIndex[i]]) - (self.test_params).append(self.params[self.testIndex[i]]) - (self.test_strings).append(self.strings[self.testIndex[i]]) - (self.test_groups).append(self.groups[self.testIndex[i]]) - for i in range(0,len(self.validIndex)): - (self.valid_dataid).append(self.dataid[self.validIndex[i]]) - (self.valid_params).append(self.params[self.validIndex[i]]) - (self.valid_strings).append(self.strings[self.validIndex[i]]) - (self.valid_groups).append(self.groups[self.validIndex[i]]) - for i in range(0,len(self.learnIndex)): - (self.learn_dataid).append(self.dataid[self.learnIndex[i]]) - (self.learn_params).append(self.params[self.learnIndex[i]]) - (self.learn_strings).append(self.strings[self.learnIndex[i]]) - (self.learn_groups).append(self.groups[self.learnIndex[i]]) - - # Builds x & y matrices (output for regression) - # Applies to whole data set, plus TVL lists - def package(self): - - # Whole data set - self.y = [sublist[0:self.controls_num_outputs] for sublist in self.params] - self.x = [sublist[self.controls_num_outputs:-1] for sublist in self.params] - x_last = [sublist[-1] for sublist in self.params] - for i in range(0,len(self.x)): - self.x[i].append(x_last[i]) - self.x = (np.asarray(self.x)).astype(np.float32) - if self.controls_num_outputs > 1: - for i in range(0,len(self.y)): - for j in range(0,len(self.y[i])): - self.y[i][j] = float(self.y[i][j]) - else: - self.y = (np.asarray(self.y)).astype(np.float32) - - # Test data set - self.test_y = [sublist[0:self.controls_num_outputs] for sublist in self.test_params] - self.test_x = [sublist[self.controls_num_outputs:-1] for sublist in self.test_params] - test_x_last = [sublist[-1] for sublist in self.test_params] - for i in range(0,len(self.test_x)): - self.test_x[i].append(test_x_last[i]) - self.test_x = (np.asarray(self.test_x)).astype(np.float32) - if self.controls_num_outputs > 1: - for i in range(0,len(self.test_y)): - for j in range(0,len(self.test_y[i])): - self.test_y[i][j] = float(self.test_y[i][j]) - else: - self.test_y = (np.asarray(self.test_y)).astype(np.float32) - - # Validation data set - self.valid_y = [sublist[0:self.controls_num_outputs] for sublist in self.valid_params] - self.valid_x = [sublist[self.controls_num_outputs:-1] for sublist in self.valid_params] - valid_x_last = [sublist[-1] for sublist in self.valid_params] - for i in range(0,len(self.valid_x)): - self.valid_x[i].append(valid_x_last[i]) - self.valid_x = (np.asarray(self.valid_x)).astype(np.float32) - if self.controls_num_outputs > 1: - for i in range(0,len(self.valid_y)): - for j in range(0,len(self.valid_y[i])): - self.valid_y[i][j] = float(self.valid_y[i][j]) - else: - self.valid_y = (np.asarray(self.valid_y)).astype(np.float32) - - # Learning data set - self.learn_y = [sublist[0:self.controls_num_outputs] for sublist in self.learn_params] - self.learn_x = [sublist[self.controls_num_outputs:-1] for sublist in self.learn_params] - learn_x_last = [sublist[-1] for sublist in self.learn_params] - for i in range(0,len(self.learn_x)): - self.learn_x[i].append(learn_x_last[i]) - self.learn_x = (np.asarray(self.learn_x)).astype(np.float32) - if self.controls_num_outputs > 1: - for i in range(0,len(self.learn_y)): - for j in range(0,len(self.learn_y[i])): - self.learn_y[i][j] = float(self.learn_y[i][j]) + string_groups[-1].append(point) + + # Reset lists for new set splits + self.learn_set = [] + self.valid_set = [] + self.test_set = [] + + # For each distinct string value from *sort_string*: + for group in string_groups: + # Assign learning data + learn_stop = int(split[0] * len(group)) + for point in group[0 : learn_stop]: + point.assignment = 'L' + self.learn_set.append(point) + # Assign validation data + valid_stop = learn_stop + int(split[1] * len(group)) + for point in group[learn_stop : valid_stop]: + point.assignment = 'V' + self.valid_set.append(point) + # Assign testing data + for point in group[valid_stop :]: + point.assignment = 'T' + self.test_set.append(point) + + ''' + Shuffles (new random assignments) the specified sets in *args*; (learning, validation, testing) + or (learning, validation) + ''' + def shuffle(self, *args, split = [0.7, 0.2, 0.1]): + + # Shuffle all sets (can just call create_sets again) + if 'l' and 'v' and 't' in args: + self.create_sets(split = split) + + # Shuffle training data (learning and validation sets) + elif 'l' and 'v' in args: + # Compile all training data into one list + lv_set = [] + for point in self.learn_set: + lv_set.append(point) + for point in self.valid_set: + lv_set.append(point) + # Generate random indices for new learning and validation sets + rand_index = rm.sample(range(len(self.learn_set) + len(self.valid_set)), (len(self.learn_set) + len(self.valid_set))) + learn_index = rand_index[0 : int(len(rand_index) * (split[0] / (1 - split[2]))) + 1] + valid_index = rand_index[int(len(rand_index) * (split[0] / (1 - split[2]))) + 1 : -1] + valid_index.append(rand_index[-1]) + + # Clear current learning and validation sets + self.learn_set = [] + self.valid_set = [] + + # Apply new indices to compiled training data, creating learning and validation sets + for idx in learn_index: + self.learn_set.append(lv_set[idx]) + for idx in valid_index: + self.valid_set.append(lv_set[idx]) else: - self.learn_y = (np.asarray(self.learn_y)).astype(np.float32) + raise Exception('ERROR: Shuffle arguments must be *l, v, t* or *l, v*') + + ''' + Private object containing lists (converted to numpy arrays by package_sets) with target + (y) and input (x) values (filled by package_sets) + ''' + class __PackagedData: + + def __init__(self): + + self.learn_x = [] + self.learn_y = [] + self.valid_x = [] + self.valid_y = [] + self.test_x = [] + self.test_y = [] + + ''' + Creates and returns *PackagedData* object containing numpy arrays with target (y) and + input (x) values for learning, validation and testing sets + ''' + def package_sets(self): + + # Create PackagedData object to return + pd = self.__PackagedData() + # Append learning inputs, learning targets to PackagedData object + for point in self.learn_set: + pd.learn_x.append(np.asarray(point.inputs).astype('float32')) + pd.learn_y.append(np.asarray(point.targets).astype('float32')) + # Append validation inputs, validation targets to PackagedData object + for point in self.valid_set: + pd.valid_x.append(np.asarray(point.inputs).astype('float32')) + pd.valid_y.append(np.asarray(point.targets).astype('float32')) + # Append testing inputs, testing targets to PackagedData object + for point in self.test_set: + pd.test_x.append(np.asarray(point.inputs).astype('float32')) + pd.test_y.append(np.asarray(point.targets).astype('float32')) + # Lists -> Numpy arrays + pd.learn_x = np.asarray(pd.learn_x) + pd.learn_y = np.asarray(pd.learn_y) + pd.valid_x = np.asarray(pd.valid_x) + pd.valid_y = np.asarray(pd.valid_y) + pd.test_x = np.asarray(pd.test_x) + pd.test_y = np.asarray(pd.test_y) + # Return packaged data + return pd + +''' +Outputs *results* to *filename*; *DataFrame*, a 'DataFrame' object, is required for +header formatting (strings, groups) and outputting individual point data +(id, assignment, strings, groups) +''' +def output_results(results, DataFrame, filename): + + # Ensure save filepath is CSV + if '.csv' not in filename: + filename += '.csv' + + # List of rows to be saved to CSV file + rows = [] + + # FIRST ROW: type headers + type_row = [] + type_row.append('DATAID') + type_row.append('ASSIGNMENT') + for string in range(DataFrame.num_strings): + type_row.append('STRING') + for group in range(DataFrame.num_groups): + type_row.append('GROUP') + for target in range(DataFrame.num_targets): + type_row.append('TARGET') + for result in results: + type_row.append('RESULT') + rows.append(type_row) + + # SECOND ROW: titles (including string, group, target names) + title_row = [] + title_row.append('DATAID') + title_row.append('ASSIGNMENT') + for string in DataFrame.string_names: + title_row.append(string) + for group in DataFrame.group_names: + title_row.append(group) + for target in DataFrame.target_names: + title_row.append(target) + for result in range(len(results)): + title_row.append(result) + rows.append(title_row) + + # Check which data set the results are from + if len(results[0]) == len(DataFrame.learn_set): + dset = 'learn' + elif len(results[0]) == len(DataFrame.valid_set): + dset = 'valid' + elif len(results[0]) == len(DataFrame.test_set): + dset = 'test' + elif len(results[0]) == (len(DataFrame.learn_set) + len(DataFrame.valid_set)): + dset = 'train' + else: + dset = None + + # If results are for training data, compile learning and validation data + if dset == 'train': + output_points = [] + for point in DataFrame.learn_set: + output_points.append(point) + for point in DataFrame.valid_set: + output_points.append(point) + # If results are for learning data, compile learning data + elif dset == 'learn': + output_points = DataFrame.learn_set + # If results are for validation data, compile validation data + elif dset == 'valid': + output_points = DataFrame.valid_set + # If results are for testing data, compile testing data + elif dset == 'test': + output_points = DataFrame.test_set + # Else, assume results are for all data, compile learning, validation and testing data + else: + output_points = [] + for point in DataFrame.learn_set: + output_points.append(point) + for point in DataFrame.valid_set: + output_points.append(point) + for point in DataFrame.test_set: + output_points.append(point) + + # Create rows for each data point in the compiled results + for idx, point in enumerate(output_points): + data_row = [] + data_row.append(point.id) + data_row.append(point.assignment) + for string in point.strings: + data_row.append(string) + for group in point.groups: + data_row.append(group) + for target in point.targets: + data_row.append(target) + for result in results: + if DataFrame.num_targets == 1: + data_row.append(result[idx][0]) + else: + data_row.append(result[idx]) + rows.append(data_row) - + # Save rows to CSV file specified in *filename* + with open(filename, 'w') as file: + wr = csv.writer(file, quoting = csv.QUOTE_ALL, lineterminator = '\n') + for row in rows: + wr.writerow(row) \ No newline at end of file diff --git a/ecnet/limit_parameters.py b/ecnet/limit_parameters.py index f640df4..86edffc 100644 --- a/ecnet/limit_parameters.py +++ b/ecnet/limit_parameters.py @@ -1,193 +1,306 @@ #!/usr/bin/env python # -*- coding: utf-8 -*- # -# ecnet_limit_parameters.py +# ecnet/limit_parameters.py +# v.1.4.0 +# Developed in 2018 by Travis Kessler # -# Developed in 2017 by Travis Kessler -# -# This program contains the functions necessary for reducing the input dimensionality of a database to the most influential input parameters +# This program contains the functions necessary for reducing the input dimensionality of a +# database to the most influential input parameters, using either an iterative inclusion +# algorithm (limit_iterative_include) or a genetic algorithm (limit_genetic). # +# 3rd party packages (open src.) import csv import copy +from pygenetics.ga_core import Population -def limit(server, param_num): - # - data = copy.copy(server.data) - - # Grabs paramter names - param_names = data.param_cols[:] - param_list = [] - - # Exclude output parameters from algorithm - for output in range(0,data.controls_num_outputs): - del param_names[output] - - # Initial definition of total lists used for limiting - learn_params = [] - valid_params = [] - total_params = [] - - # Until the param_list is populated with specified number of params: - while len(param_list) < param_num: - - # Used for determining which paramter at the current iteration performs the best - param_RMSE_list = [] - - # Grabs each parameter one at a time - for parameter in range(0,len(data.x[0])): - - # Each parameter is a sublist of the total lists - learn_params_add = [sublist[parameter] for sublist in data.learn_x] - valid_params_add = [sublist[parameter] for sublist in data.valid_x] - total_params_add = [sublist[parameter] for sublist in data.x] - - # Formatting - for i in range(0,len(learn_params_add)): - learn_params_add[i] = [learn_params_add[i]] - for i in range(0,len(valid_params_add)): - valid_params_add[i] = [valid_params_add[i]] - for i in range(0,len(total_params_add)): - total_params_add[i] = [total_params_add[i]] - - # If looking for the first parameter, each parameter is tested individually - if len(learn_params) is 0: - learn_input = learn_params_add[:] - valid_input = valid_params_add[:] - total_input = total_params_add[:] - - # Else, new parameter in question is appended to the current parameter list +# ECNet source files +import ecnet.model +import ecnet.error_utils + +''' +Limits the dimensionality of input data found in supplied *DataFrame* object to a +dimensionality of *limit_num* +''' +def limit_iterative_include(DataFrame, limit_num): + + # List of retained input parameters + retained_input_list = [] + + # Initialization of retained input lists + learn_input_retained = [] + valid_input_retained = [] + test_input_retained = [] + + # Until specified number of paramters *limit_num* are retained + while len(retained_input_list) < limit_num: + + # List of RMSE's for currently retained inputs + new inputs to test + retained_rmse_list = [] + + # For all input paramters to test + for idx, param in enumerate(DataFrame.input_names): + + # Shuffle all sets + DataFrame.shuffle('l', 'v', 't') + # Obtain Numpy arrays for learning, validation, testing sets + packaged_data = DataFrame.package_sets() + + # Obtain input parameter column for learning, validation and test sets + learn_input_add = [[sublist[idx]] for sublist in packaged_data.learn_x] + valid_input_add = [[sublist[idx]] for sublist in packaged_data.valid_x] + test_input_add = [[sublist[idx]] for sublist in packaged_data.test_x] + + # No retained input parameters, inputs = individual parameters to be tested + if len(retained_input_list) is 0: + learn_input = learn_input_add + valid_input = valid_input_add + test_input = test_input_add + # Else: else: - learn_input = [] - valid_input = [] - total_input = [] - - # Adds the current paramter lists to the inputs - for i in range(0,len(learn_params)): - learn_input.append(learn_params[i][:]) - for i in range(0,len(valid_params)): - valid_input.append(valid_params[i][:]) - for i in range(0,len(total_params)): - total_input.append(total_params[i][:]) - - # Adds the new paramter in question - for i in range(0,len(learn_params_add)): - learn_input[i].append(learn_params_add[i][0]) - for i in range(0,len(valid_params_add)): - valid_input[i].append(valid_params_add[i][0]) - for i in range(0,len(total_params_add)): - total_input[i].append(total_params_add[i][0]) - - # Re-imports data for training - #server.import_data() - - # Assigns the configured data to the server data object - server.data.x = total_input[:] - server.data.y = data.y[:] - server.data.learn_x = learn_input[:] - server.data.learn_y = data.learn_y[:] - server.data.valid_x = valid_input[:] - server.data.valid_y = data.valid_y[:] - - # Trains the model - server.fit_mlp_model_validation() - - # Determines the RMSE of the model with the current inputs, adds it to total list - local_rmse = server.calc_error('rmse')['rmse'] - param_RMSE_list.append(local_rmse) - - # Determines lowest RMSE of the current iteration, which corresponds to the best performing parameter - val, idx = min((val, idx) for (idx, val) in enumerate(param_RMSE_list)) - - # Packages the best performing parameter - add_to_learn = [sublist[idx] for sublist in data.learn_x] - add_to_valid = [sublist[idx] for sublist in data.valid_x] - add_to_total = [sublist[idx] for sublist in data.x] - - # Adds the best performing parameter to the total lists ***Conditional used for formatting discrepancies - if len(param_list) is 0: - for i in range(0,len(add_to_learn)): - learn_params.append([add_to_learn[i]]) - for i in range(0,len(add_to_valid)): - valid_params.append([add_to_valid[i]]) - for i in range(0,len(add_to_total)): - total_params.append([add_to_total[i]]) + # Inputs = currently retained input parameters + learn_input = copy.deepcopy(learn_input_retained) + valid_input = copy.deepcopy(valid_input_retained) + test_input = copy.deepcopy(test_input_retained) + # Add new input parameter to inputs + for idx_add, param_add in enumerate(learn_input_add): + learn_input[idx_add].append(param_add[0]) + for idx_add, param_add in enumerate(valid_input_add): + valid_input[idx_add].append(param_add[0]) + for idx_add, param_add in enumerate(test_input_add): + test_input[idx_add].append(param_add[0]) + + # Create neural network model + mlp_model = ecnet.model.MultilayerPerceptron() + mlp_model.add_layer(len(learn_input[0]), 'relu') + mlp_model.add_layer(5, 'relu') + mlp_model.add_layer(5, 'relu') + mlp_model.add_layer(len(packaged_data.learn_y[0]), 'linear') + mlp_model.connect_layers() + + # Fit the model using validation + mlp_model.fit_validation( + learn_input, + packaged_data.learn_y, + valid_input, + packaged_data.valid_y, + max_epochs = 5000) + + # Calculate error for test set results, append to rmse list + retained_rmse_list.append(ecnet.error_utils.calc_rmse( + mlp_model.use(test_input), + packaged_data.test_y)) + + # Obtain index, value of best performing input paramter addition + rmse_val, rmse_idx = min((rmse_val, rmse_idx) for (rmse_idx, rmse_val) in enumerate(retained_rmse_list)) + + # Obtain input parameter addition with lowest error + learn_retain_add = [[sublist[rmse_idx]] for sublist in packaged_data.learn_x] + valid_retain_add = [[sublist[rmse_idx]] for sublist in packaged_data.valid_x] + test_retain_add = [[sublist[rmse_idx]] for sublist in packaged_data.test_x] + + # No retained input parameters, retained = lowest error input parameter + if len(retained_input_list) is 0: + learn_input_retained = learn_retain_add + valid_input_retained = valid_retain_add + test_input_retained = test_retain_add + # Else: else: - for i in range(0,len(add_to_learn)): - learn_params[i].append(add_to_learn[i]) - for i in range(0,len(add_to_valid)): - valid_params[i].append(add_to_valid[i]) - for i in range(0,len(add_to_total)): - total_params[i].append(add_to_total[i]) - - # Adds the best performing parameter to the parameter list - param_list.append(param_names[idx]) - - # Prints the parameter list after each iteration, as well as the RMSE - if server.vars['project_print_feedback'] == True: - print(param_list) - print(val) + # Append lowest error input parameter to retained parameters + for idx, param in enumerate(learn_retain_add): + learn_input_retained[idx].append(param[0]) + for idx, param in enumerate(valid_retain_add): + valid_input_retained[idx].append(param[0]) + for idx, param in enumerate(test_retain_add): + test_input_retained[idx].append(param[0]) + + # Append name of retained input parameter to retained list + retained_input_list.append(DataFrame.input_names[rmse_idx]) + # List currently retained input parameters + print(retained_input_list) + print(rmse_val) + print() + + # Compiled *limit_num* input parameters, return list of retained parameters + return retained_input_list + +''' +Limits the dimensionality of input data found in supplied *DataFrame* object to a +dimensionality of *limit_num* using a genetic algorithm. Optional arguments for +*population_size* of genetic algorithm's population, *num_survivors* for selecting +the best performers from each population generation to reproduce, *num_generations* +for the number of times the population will reproduce, and *print_feedback* for +printing the average fitness score of the population after each generation. +''' +def limit_genetic(DataFrame, limit_num, population_size, num_survivors, num_generations, print_feedback = True): + + ''' + Genetic algorithm cost function, supplied to the genetic algorithm; returns the RMSE + of the test set results from a model constructed using the current permutation of + input parameters *feed_dict* supplied by the genetic algorithm + ''' + def ecnet_limit_inputs(feed_dict): + + # Set up learning, validation and testing sets + learn_input = [] + valid_input = [] + test_input = [] + + # For the input parameters chosen by the genetic algorithm: + for idx, param in enumerate(feed_dict): + + # Grab the input parameter + learn_input_add = [[sublist[feed_dict[param]]] for sublist in packaged_data.learn_x] + valid_input_add = [[sublist[feed_dict[param]]] for sublist in packaged_data.valid_x] + test_input_add = [[sublist[feed_dict[param]]] for sublist in packaged_data.test_x] + + # Currently empty sets, sets = add lists + if len(learn_input) == 0: + learn_input = learn_input_add + valid_input = valid_input_add + test_input = test_input_add + # Append add lists to sets + else: + for idx_add, param_add in enumerate(learn_input_add): + learn_input[idx_add].append(param_add[0]) + for idx_add, param_add in enumerate(valid_input_add): + valid_input[idx_add].append(param_add[0]) + for idx_add, param_add in enumerate(test_input_add): + test_input[idx_add].append(param_add[0]) - # Returns the parameter list - return param_list - -def output(data, param_list, filename): - # Checks for .csv file format - if ".csv" not in filename: - filename = filename + ".csv" - # Creates list of spreadsheet rows + # Construct a neural network (multilayer perceptron) model + mlp_model = ecnet.model.MultilayerPerceptron() + mlp_model.add_layer(len(learn_input[0]), 'relu') + mlp_model.add_layer(8, 'relu') + mlp_model.add_layer(8, 'relu') + mlp_model.add_layer(len(packaged_data.learn_y[0]), 'linear') + mlp_model.connect_layers() + + # Train the model using validation + mlp_model.fit_validation( + learn_input, + packaged_data.learn_y, + valid_input, + packaged_data.valid_y, + max_epochs = 5000) + + # Returned fitness value = test set performance + return ecnet.error_utils.calc_rmse(mlp_model.use(test_input), packaged_data.test_y) + + ''' + Genetic algorithm selection function, supplied to the genetic algorithm; returns + the *n* best performing *members* from the genetic algorithm's population + ''' + def minimize_best_n(members, n): + return(sorted(members, key = lambda member: member.fitness_score)[0:n]) + + # Package data for training/testing + packaged_data = DataFrame.package_sets() + + # Initialize genetic algorithm population + population = Population(size = population_size, cost_fn = ecnet_limit_inputs, select_fn = minimize_best_n) + + # Create genetic algorithm parameters for each input parameter *n* + for i in range(limit_num): + population.add_parameter(i, 0, DataFrame.num_inputs - 1) + + # Generate the genetic algorithm's initial population + population.generate_population() + + # Print average population fitness (if printing feedback) + if print_feedback: + print('Generation: 0 - Population fitness: ' + str(sum(p.fitness_score for p in population.members) / len(population))) + + # Run the genetic algorithm for *num_generations* generations + for gen in range(num_generations): + population.next_generation(num_survivors = num_survivors, mut_rate = 0) + if print_feedback: + print('Generation: ' + str(gen + 1) + ' - Population fitness: ' + str(sum(p.fitness_score for p in population.members) / len(population))) + + # Find the best performing member from the final generation + min_idx = 0 + for new_idx, member in enumerate(population.members): + if member.fitness_score < population.members[min_idx].fitness_score: + min_idx = new_idx + + # Obtain parameter names from best performer + input_list = [] + for val in population.members[min_idx].feed_dict.values(): + input_list.append(DataFrame.input_names[val]) + + # Print best performer (if printing feedback) + if print_feedback: + print('Best member fitness score: ' + str(population.members[min_idx].fitness_score)) + print(input_list) + + # Return the limited list + return input_list + +''' +Saves the parameters *param_list* (obtained from limit) to new database specified +by *filename*. A *DataFrame* object is required for new database formatting and +populating. +''' +def output(DataFrame, param_list, filename): + + # Check filename format + if '.csv' not in filename: + filename += '.csv' + + # List of rows to be saved to CSV file rows = [] - # Row 1: Main controls - control_row_1 = ["NUM OF MASTER"] - for i in range(0,len(data.controls_param_cols)): - control_row_1.append(data.controls_param_cols[i]) - rows.append(control_row_1) - # Row 2: Main control values - control_row_2 = [data.controls_m_param_count] - for i in range(0,len(data.control_params)): - control_row_2.append(data.control_params[i]) - rows.append(control_row_2) - # Rows 3 and 4: Column groups and sub-groups - row_3 = ["DATAID", "T/V/L/U"] - row_4 = ["DATAid", "T/V/L"] - if data.controls_num_str != 0: - row_3.append("STRINGS") - for i in range(0,data.controls_num_str - 1): - row_3.append(" ") - for i in range(0,len(data.string_cols)): - row_4.append(data.string_cols[i]) - if data.controls_num_grp != 0: - row_3.append("GROUPS") - for i in range(0,data.controls_num_grp - 1): - row_3.append(" ") - for i in range(0,len(data.group_cols)): - row_4.append(data.group_cols[i]) - row_3.append("PARAMETERS") - rows.append(row_3) - for i in range(0,data.controls_num_outputs): - row_4.append(data.param_cols[i]) - param_idx = [] - for i in range(0,len(param_list)): - row_4.append(param_list[i]) - for j in range(0,len(data.param_cols)): - if param_list[i] == data.param_cols[j]: - param_idx.append(j) - break - rows.append(row_4) - # Data value rows - for i in range(0,len(data.dataid)): - local_row = [data.dataid[i], data.tvl_strings[i]] - for j in range(0,len(data.strings[i])): - local_row.append(data.strings[i][j]) - for j in range(0,len(data.groups[i])): - local_row.append(data.groups[i][j]) - for j in range(0,data.controls_num_outputs): - local_row.append(data.params[i][j]) - for j in range(0,len(param_idx)): - local_row.append(data.params[i][param_idx[j]]) - rows.append(local_row) - # Output to file - with open(filename, 'w') as output_file: - wr = csv.writer(output_file, quoting = csv.QUOTE_ALL, lineterminator = '\n') - for row in range(0,len(rows)): - wr.writerow(rows[row]) + + # FIRST ROW: type headers + type_row = [] + type_row.append('DATAID') + type_row.append('ASSIGNMENT') + for string in DataFrame.string_names: + type_row.append('STRING') + for group in DataFrame.group_names: + type_row.append('GROUP') + for target in DataFrame.target_names: + type_row.append('TARGET') + for input_param in param_list: + type_row.append('INPUT') + rows.append(type_row) + + # SECOND ROW: titles (including string, group, target, input names) + title_row = [] + title_row.append('DATAID') + title_row.append('ASSIGNMENT') + for string in DataFrame.string_names: + title_row.append(string) + for group in DataFrame.group_names: + title_row.append(group) + for target in DataFrame.target_names: + title_row.append(target) + for input_param in param_list: + title_row.append(input_param) + rows.append(title_row) + + # Obtain new parameter name indices in un-limited database + input_param_indices = [] + for param in param_list: + input_param_indices.append(DataFrame.input_names.index(param)) + + # Create rows for each data point found in the DataFrame + for point in DataFrame.data_points: + data_row = [] + data_row.append(point.id) + data_row.append(point.assignment) + for string in point.strings: + data_row.append(string) + for group in point.groups: + data_row.append(group) + for target in point.targets: + data_row.append(target) + for param in input_param_indices: + data_row.append(point.inputs[param]) + rows.append(data_row) + + # Save all the rows to the new database file + with open(filename, 'w') as file: + wr = csv.writer(file, quoting = csv.QUOTE_ALL, lineterminator = '\n') + for row in rows: + wr.writerow(row) \ No newline at end of file diff --git a/ecnet/model.py b/ecnet/model.py index 98efda1..12b30d3 100644 --- a/ecnet/model.py +++ b/ecnet/model.py @@ -2,186 +2,239 @@ # -*- coding: utf-8 -*- # # ecnet/error_utils.py -# v.1.3.0.dev1 -# Developed in 2018 by Travis Kessler +# v.1.4.0 +# Developed in 2018 by Travis Kessler # -# This program contains functions necessary creating, training, saving, and importing neural network models +# This program contains functions necessary creating, training, saving, and reusing neural network models # +# 3rd party packages (open src.) import tensorflow as tf import numpy as np import pickle -import os from functools import reduce -class multilayer_perceptron: - # initialization of model structure +''' +Basic neural network (multilayer perceptron); contains methods for adding layers +with specified neuron counts and activation functions, training the model, using +the model on new data, saving the model for later use, and reusing a previously +trained model +''' +class MultilayerPerceptron: + + ''' + Initialization of object + ''' def __init__(self): + self.layers = [] self.weights = [] self.biases = [] tf.reset_default_graph() - - # adds a skeleton for the layer: [number of neurons, activation function] - def addLayer(self, num_neurons, function = "relu"): - self.layers.append([num_neurons, function]) - - # connects skeleton layers: results in random weights and biases - def connectLayers(self): - # weights - for layer in range(0, len(self.layers)-1): - self.weights.append(tf.Variable(tf.random_normal([self.layers[layer][0], self.layers[layer+1][0]]), name = "W_fc%d"%(layer + 1))) - # biases - for layer in range(1, len(self.layers)): - self.biases.append(tf.Variable(tf.random_normal([self.layers[layer][0]]), name = "B_fc%d"%(layer))) - # function for feeding data through the model, and returns the output - def feed_forward(self, x): - layerOutput = [x] + ''' + Layer definition object, containing layer size and activation function to be used + ''' + class Layer: + + def __init__(self, size, act_fn): + + self.size = size + self.act_fn = act_fn + + ''' + Adds a layer definition to the model; default activation function is ReLU + ''' + def add_layer(self, size, act_fn = 'relu'): + + self.layers.append(self.Layer(size, act_fn)) + + ''' + Connects the layers in *self.layers* by creating weight matrices, bias vectors + ''' + def connect_layers(self): + + # Create weight matrices (size = layer_n by layer_n+1) + for layer in range(len(self.layers) - 1): + self.weights.append(tf.Variable(tf.random_normal([self.layers[layer].size, self.layers[layer + 1].size]), name = 'W_fc%d' % (layer + 1))) + # Create bias vectors (size = layer_n) for layer in range(1, len(self.layers)): - # relu - if "relu" in self.layers[layer][1]: - layerOutput.append(tf.nn.relu(tf.add(tf.matmul(layerOutput[-1], self.weights[layer - 1]), self.biases[layer - 1]))) - # sigmoid - elif "sigmoid" in self.layers[layer][1]: - layerOutput.append(tf.nn.sigmoid(tf.add(tf.matmul(layerOutput[-1], self.weights[layer - 1]), self.biases[layer - 1]))) - # linear - elif "linear" in self.layers[layer][1]: - layerOutput.append(tf.add(tf.matmul(layerOutput[-1], self.weights[layer - 1]), self.biases[layer - 1])) - elif "softmax" in self.layers[layer][1]: - layerOutput.append(tf.nn.softmax(tf.add(tf.matmul(layerOutput[-1], self.weights[layer - 1]), self.biases[layer - 1]))) - return(layerOutput[-1]) - - ### Data is served to the model, and fits the model to the data + self.biases.append(tf.Variable(tf.random_normal([self.layers[layer].size]), name = 'B_fc%d' % (layer))) + + ''' + Fits the neural network model using input data *x_l* and target data *y_l*. Optional arguments: + *learning_rate* (training speed of the model) and *train_epochs* (number of traning iterations). + ''' def fit(self, x_l, y_l, learning_rate = 0.1, train_epochs = 500): - # placeholder variables for input and output matrices - x = tf.placeholder("float", [None, self.layers[0][0]]) - y = tf.placeholder("float", [None, self.layers[-1][0]]) - pred = self.feed_forward(x) - - # cost function and optimizer - TODO: look into other optimizers besides Adam + + # TensorFlow placeholder variables for inputs and targets + x = tf.placeholder('float', [None, self.layers[0].size]) + y = tf.placeholder('float', [None, self.layers[-1].size]) + + # Predictions = *__feed_forward* final output + pred = self.__feed_forward(x) + + # Cost function = squared error between targets and predictions cost = tf.square(y - pred) + + # Optimizer = AdamOptimizer (TODO: Look into other optimizers) optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(cost) - - # opens the tensorflow session for training - with tf.Session() as self.sess: - # initialize the pre-defined variables - self.sess.run(tf.global_variables_initializer()) - # runs training loop for explicit number of epochs -> find in config.yaml + + # Run the TensorFlow session + with tf.Session() as sess: + # Initialize TensorFlow variables (weights, biases, placeholders) + sess.run(tf.global_variables_initializer()) + # Train for *train_epochs* iterations for epoch in range(train_epochs): - self.sess.run(optimizer, feed_dict = {x: x_l, y: y_l}) - # saves a temporary output file, variables (weights, biases) included + sess.run(optimizer, feed_dict = {x: x_l, y: y_l}) + # Saves model (weights and biases) to temporary output file saver = tf.train.Saver() - saver.save(self.sess,"./tmp/_ckpt") - self.sess.close() - - ### Data is served to the model, and fits the model to the data using periodic validation - def fit_validation(self, x_l, x_v, y_l, y_v, learning_rate = 0.1, mdrmse_stop = 0.1, mdrmse_memory = 50, max_epochs = 500): - # placeholder variables for input and output matrices - x = tf.placeholder("float", [None, self.layers[0][0]]) - y = tf.placeholder("float", [None, self.layers[-1][0]]) - pred = self.feed_forward(x) - - # variables and arrays for validation process - mdRMSE = 1 - current_epoch = 0 - rmse_list = [] - delta_list = [] - - # cost function and optimizer - TODO: look into other optimizers besides Adam + saver.save(sess, './tmp/_ckpt') + # Finish the TensorFlow session + sess.close() + + ''' + Fits the neural network model using input data *x_l* and target data *y_l*, validating + the learning process periodically based on validation data (*x_v* and *y_v*) performance). + Optional arguments: *learning_rate* (training speed of the model), *max_epochs* (cutoff + point if training takes too long) + ''' + def fit_validation(self, x_l, y_l, x_v, y_v, learning_rate = 0.1, max_epochs = 2500): + + # TensorFlow placeholder variables for inputs and targets + x = tf.placeholder('float', [None, self.layers[0].size]) + y = tf.placeholder('float', [None, self.layers[-1].size]) + + # Predictions = *__feed_forward* final output + pred = self.__feed_forward(x) + + # Cost function = squared error between targets and predictions cost = tf.square(y - pred) + + # Optimizer = AdamOptimizer (TODO: Look into other optimizers) optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(cost) + + # Run the TensorFlow session + with tf.Session() as sess: + # Initialize TensorFlow variables (weights, biases, placeholders) + sess.run(tf.global_variables_initializer()) - # opens the tensorflow session for training - with tf.Session() as self.sess: - # initialize the pre-defined variables - self.sess.run(tf.global_variables_initializer()) - # while current mdRMSE is more than the cutoff point, and the max num of epochs hasn't been reached: - while mdRMSE > mdrmse_stop and current_epoch < max_epochs: - self.sess.run(optimizer, feed_dict = {x: x_l, y: y_l}) + # Lowest validation RMSE value (used as cutoff if validation RMSE rises above by 5%) + valid_rmse_lowest = 50 + current_epoch = 0 + + while current_epoch < max_epochs: + # Run training iteration + sess.run(optimizer, feed_dict = {x: x_l, y: y_l}) + # Current epoch ++ current_epoch += 1 - # determine new mdRMSE after every 100 epochs - if current_epoch % 100 == 0: - valid_pred = self.sess.run(pred, feed_dict = {x: x_v}) - rmse_list.append(calc_valid_rmse(valid_pred, y_v)) - if len(rmse_list) > 1: - delta_list.append(abs(rmse_list[-2] - rmse_list[-1])) - # mdRMSE memory: how far back the function looks to determine mdRMSE - if len(delta_list) > mdrmse_memory: - del(delta_list[0]) - mdRMSE = reduce(lambda x, y: x + y, delta_list) / len(delta_list) - + # Every 250 epochs (TODO: make this a variable?): + if current_epoch % 250 == 0: + # Calculate validation set RMSE + valid_rmse = self.__calc_rmse(sess.run(pred, feed_dict = {x: x_v}), y_v) + # If RMSE is less than the current lowest RMSE, make new lowest RMSE + if valid_rmse < valid_rmse_lowest: + valid_rmse_lowest = valid_rmse + # If RMSE is greater than the current lowest + 5%, done with training + elif valid_rmse > valid_rmse_lowest + (0.05 * valid_rmse_lowest): + break + + # Done training, save the model to temporary file saver = tf.train.Saver() - saver.save(self.sess, "./tmp/_ckpt") - self.sess.close() - - ### Tests the test data from the server - def test_new(self, x): - with tf.Session() as self.sess: + saver.save(sess, './tmp/_ckpt') + # Finish the TensorFlow session + sess.close() + + ''' + Use the neural network model on input data *x*, returns *result* + ''' + def use(self, x): + + # Run the TensorFlow session + with tf.Session() as sess: + # Import temporary output file containing weights and biases saver = tf.train.Saver() - saver.restore(self.sess, "./tmp/_ckpt") - result = self.feed_forward(x) - result = result.eval() - self.sess.close() + saver.restore(sess, './tmp/_ckpt') + # Evaluate result + result = self.__feed_forward(x).eval() + # Finish the TensorFlow session + sess.close() + # Return the result return result - - ### Saves the _ckpt.ecnet file to a pre-defined output file - def save_net(self, output_filepath): - with tf.Session() as self.sess: + + ''' + Saves the neural network model (outside of temp file) to *filepath* for later use + ''' + def save(self, filepath): + + # Run the TensorFlow session + with tf.Session() as sess: + # Import temporary output file containing weights and biases saver = tf.train.Saver() - saver.restore(self.sess, "./tmp/_ckpt") - saver.save(self.sess, "./" + output_filepath + ".sess") - self.sess.close() - architecture_file = open("./" + output_filepath + ".struct", "wb") + saver.restore(sess, './tmp/_ckpt') + # Resave temporary file to file specified by *filepath* + saver.save(sess, './' + filepath) + # Finish the TensorFlow session + sess.close() + # Save the neural network model's architecture (layer sizes, activation functions) + architecture_file = open('./' + filepath + '.struct', 'wb') pickle.dump(self.layers, architecture_file) architecture_file.close() - - ### Loads a pre-defined file into the model - def load_net(self, model_load_filename): - architecture_file = open("./" + model_load_filename + ".struct", "rb") + + ''' + Loads the neural network model found at *filepath* + ''' + def load(self, filepath): + + # Import the architecture 'struct' file (layer sizes, activation functions) + architecture_file = open('./' + filepath + '.struct', 'rb') self.layers = pickle.load(architecture_file) architecture_file.close() - self.connectLayers() - with tf.Session() as self.sess: - saver = tf.train.Saver() - saver.restore(self.sess, "./" + model_load_filename + ".sess") - saver.save(self.sess, "./tmp/_ckpt") - self.sess.close() - - ### Return numerical values for weights - def export_weights(self): - weights = [] - with tf.Session() as self.sess: - saver = tf.train.Saver() - saver.restore(self.sess, "./tmp/_ckpt") - for i in range(0,len(self.weights)): - weights.append(self.weights[i].eval()) - return weights - - ### Return numerical values for biases - def export_biases(self): - biases = [] - with tf.Session() as self.sess: + # Redefine weights and biases + self.connect_layers() + # Run the TensorFlow session + with tf.Session() as sess: + # Import file containing weights and biases saver = tf.train.Saver() - saver.restore(self.sess, "./tmp/_ckpt") - for i in range(0,len(self.biases)): - biases.append(self.biases[i].eval()) - return biases - -def calc_valid_rmse(x, y): - try: - return(np.sqrt(((x-y)**2).mean())) - except: + saver.restore(sess, './' + filepath) + # Save weights and biases to temporary file for use by 'fit', 'use', etc. + saver.save(sess, './tmp/_ckpt') + # Finish the TensorFlow session + sess.close() + + ''' + PRIVATE METHOD: Feeds data through the neural network, returns output of final layer + ''' + def __feed_forward(self, x): + + # First values to matrix multiply are the inputs + output = x + # For each layer (after the first layer, input) + for index, layer in enumerate(self.layers[1:]): + # ReLU activation function + if layer.act_fn == 'relu': + output = tf.nn.relu(tf.add(tf.matmul(output, self.weights[index]), self.biases[index])) + # Sigmoid activation function + elif layer.act_fn == 'sigmoid': + output = tf.nn.relu(tf.add(tf.matmul(output, self.weights[index]), self.biases[index])) + # Linear activation function + elif layer.act_fn == 'linear': + output = tf.add(tf.matmul(output, self.weights[index]), self.biases[index]) + # Softmax activation function + elif layer.act_fn == 'softmax': + output = tf.nn.softmax(tf.add(tf.matmul(output, self.weights[index]), self.biases[index])) + + # Return the final layer's output + return output + + ''' + PRIVATE METHOD: Calculates the RMSE of the validation set during training + ''' + def __calc_rmse(self, y_hat, y): try: - return(np.sqrt(((np.asarray(x)-np.asarray(y))**2).mean())) + return(np.sqrt(((y_hat-y)**2).mean())) except: - print("Error in calculating RMSE. Check input data format.") - sys.exit() - - - - - - - - + try: + return(np.sqrt(((np.asarray(y_hat)-np.asarray(y))**2).mean())) + except: + raise Exception('ERROR: Unable to calculate RMSE. Check input data format.') \ No newline at end of file diff --git a/ecnet/server.py b/ecnet/server.py index 40fe40f..e823a64 100644 --- a/ecnet/server.py +++ b/ecnet/server.py @@ -2,448 +2,570 @@ # -*- coding: utf-8 -*- # # ecnet/server.py -# v.1.3.0.dev1 -# Developed in 2018 by Travis Kessler +# v.1.4.0 +# Developed in 2018 by Travis Kessler # -# This program contains all the necessary config parameters and network serving functions +# This file contains the "Server" class, which handles ECNet project creation, +# neural network model creation, data hand-off to models, and project error +# calculation. For example scripts, refer to https://github.com/tjkessler/ecnet # # 3rd party packages (open src.) import yaml -import pickle -import numpy as np -import sys +import warnings import os +import numpy as np import zipfile -from shutil import copyfile +import pickle +from ecabc.abc import ABC -# ECNet program files +# ECNet source files import ecnet.data_utils +import ecnet.error_utils import ecnet.model import ecnet.limit_parameters -import ecnet.error_utils -import ecnet.abc -### Config/server object; to be referenced by most other files ### +''' +Server object: handles project creation/usage for ECNet projects, handles +data hand-off to neural networks for model training +''' class Server: - ### Initial declaration, handles config import - def __init__(self): + + ''' + Initialization: imports configuration variables from *filename* + ''' + def __init__(self, filename = 'config.yml'): + + # Dictionary containing configuration variables self.vars = {} - self.vars.update(import_config(filename = 'config.yml')) - self.folder_structs_built = False - ### Imports the data and stores it in the server + # Open configuration file found at *filename* + try: + file = open(filename, 'r') + self.vars.update(yaml.load(file)) + + # Configuration file not found, create default 'config.yml' + except: + warnings.warn('WARNING: supplied configuration file not found: creating default config.yml') + config_dict = { + 'data_filename' : 'data.csv', + 'data_sort_type' : 'random', + 'data_split' : [0.65,0.25,0.10], + 'learning_rate' : 0.1, + 'mlp_hidden_layers' : [[5, 'relu'], [5, 'relu']], + 'mlp_in_layer_activ' : 'relu', + 'mlp_out_layer_activ' : 'linear', + 'project_name' : 'my_project', + 'project_num_builds' : 1, + 'project_num_nodes' : 1, + 'project_num_trials' : 1, + 'project_print_feedback': True, + 'train_epochs' : 500, + 'valid_max_epochs': 5000 + } + filename = 'config.yml' + file = open(filename, 'w') + yaml.dump(config_dict, file) + self.vars.update(config_dict) + + # Set configuration filename + self.config_filename = filename + # Initial state of Server is to create single models + self.using_project = False + + + ''' + Creates the folder structure for a project; if not called, Server will create + just one neural network model. A project consists of builds, each build + containing nodes (node predictions are averaged for final build prediction). + Number of builds and nodes are specified in the configuration file. + ''' + def create_project(self, project_name = None): + + # If alternate project name is not given, use configuration file's project name + if project_name is None: + project_name = self.vars['project_name'] + # If alternate name is given, update Server config with new name + else: + self.vars['project_name'] = project_name + # Create base project folder in working directory (if not already there) + if not os.path.exists(project_name): + os.makedirs(project_name) + # For each build (number of builds specified in config): + for build in range(self.vars['project_num_builds']): + # Create build path (project + build number) + path_b = os.path.join(project_name, 'build_%d' % build) + # Create folder (if it doesn't exist) + if not os.path.exists(path_b): + os.makedirs(path_b) + # For each node (number of nodes specified in config): + for node in range(self.vars['project_num_nodes']): + # Create node path (project + build number + node number) + path_n = os.path.join(path_b, 'node_%d' % node) + # Create folder (if it doesn't exits) + if not os.path.exists(path_n): + os.makedirs(path_n) + # Update Server boolean, indicating we are now using a project (instead of single model) + self.using_project = True + + ''' + Imports data from *data_filename*; utilizes 'data_utils' for file I/O, data set splitting + and data set packaging for hand-off to neural network models + ''' def import_data(self, data_filename = None): + + # If no filename specified, use configuration filename if data_filename is None: data_filename = self.vars['data_filename'] - self.data = ecnet.data_utils.initialize_data(data_filename) - self.data.build() - self.data.buildTVL(self.vars['data_sort_type'], self.vars['data_split']) - if self.vars['normals_use'] == True: - if not os.path.isdir('./tmp'): - os.makedirs('./tmp') - if not os.path.exists('./tmp/normal_params.ecnet'): - self.data.normalize('./tmp/normal_params') + # Else, set config variable to supplied filename + else: + self.vars['data_filename'] = data_filename + # Import the data using *data_utils* *DataFrame* object + self.DataFrame = ecnet.data_utils.DataFrame(data_filename) + # Create learning, validation and testing sets + if self.vars['data_sort_type'] == 'random': + self.DataFrame.create_sets(random = True, split = self.vars['data_split']) + else: + self.DataFrame.create_sets(random = False) + # Package sets for model hand-off + self.packaged_data = self.DataFrame.package_sets() + + def limit_parameters(self, limit_num, output_filename, use_genetic = False, population_size = 500, num_survivors = 200, num_generations = 25): + + if use_genetic: + params = ecnet.limit_parameters.limit_genetic(self.DataFrame, limit_num, population_size, num_survivors, num_generations, print_feedback = self.vars['project_print_feedback']) + else: + params = ecnet.limit_parameters.limit_iterative_include(self.DataFrame, limit_num) + ecnet.limit_parameters.output(self.DataFrame, params, output_filename) + + ''' + Tunes the neural network learning hyperparameters (learning_rate, valid_max_epochs, neuron + counts in each hidden layer) using an artificial bee colony algorithm (ecabc package) + ''' + def tune_hyperparameters(self, target_score = None, iteration_amt = 50, amt_employers = 50): + + # Make sure project is not constructed + if self.using_project: + warnings.warn('WARNING: tune_hyperparameters() uses individual neural networks, not projects. Setting using_project boolean to false.') + self.using_project = False + + ''' + Fitness function to be used by the artificial bee colony + ''' + def test_neural_network(values): + + self.vars['learning_rate'] = values[0] + self.vars['valid_max_epochs'] = values[1] + self.vars['mlp_hidden_layers'][0][0] = values[2] + self.vars['mlp_hidden_layers'][1][0] = values[3] + + self.train_model(validate = True) + return self.calc_error('rmse', dset = 'test')['rmse'] + + # Minimum and maximum values for hyperparameters (learning rate, valid_max_epochs, hidden layer neuron count) + hyperparameters = [('float', (0.01, 0.2)), ('int', (1000, 25000)), ('int', (8, 32)), ('int', (8, 32))] + + # If *target_score* (RMSE) is not given, run ABC for *iteration_amt* iterations + if target_score is None: + abc = ABC(iterationAmount = iteration_amt, + fitnessFunction = test_neural_network, + valueRanges = hyperparameters, + amountOfEmployers = amt_employers) + + # Else, run ABC until *target_score* is reached + else: + abc = ABC(endValue = target_score, + fitnessFunction = test_neural_network, + valueRanges = hyperparameters, + amountOfEmployers = amt_employers) + + # Run the artificial bee colony + abc.printInfo(self.vars['project_print_feedback']) + new_hyperparameters = abc.runABC() + + # Set Server hyperparameters to ABC-calculated hyperparameters + self.vars['learning_rate'] = new_hyperparameters[0] + self.vars['valid_max_epochs'] = new_hyperparameters[1] + self.vars['mlp_hidden_layers'][0][0] = new_hyperparameters[2] + self.vars['mlp_hidden_layers'][1][0] = new_hyperparameters[3] + + # Return ABC-calculated hyperparameters + return new_hyperparameters + + ''' + Trains a neural network (multilayer perceptron) using learning data, and validation data (if *validate* + == True). *args is used to specify shuffling of data sets for each trial; use "shuffle_lv" (shuffles + training data) or "shuffle_lvt" (shuffles all data) + ''' + def train_model(self, *args, validate = False): + + # Not using project, train single model + if not self.using_project: + # Create the model, train the model, save the model to temp folder + mlp_model = self.__create_mlp_model() + # Use validation sets to periodically test model's performance and determine when to stop; + # prevents overfitting + if validate: + mlp_model.fit_validation( + self.packaged_data.learn_x, + self.packaged_data.learn_y, + self.packaged_data.valid_x, + self.packaged_data.valid_y, + self.vars['learning_rate'], + self.vars['valid_max_epochs']) + # No validation is used, just train for 'training_epochs' iterations else: - self.data.applyNormal('./tmp/normal_params') - self.data.applyTVL() - self.data.package() - - ### Determines which 'param_num' parameters contribute to an accurate output; supply the number of parameters to limit to, and the output filename - def limit_parameters(self, param_num, limited_database_output_filename): - params = ecnet.limit_parameters.limit(self, param_num) - ecnet.limit_parameters.output(self.data, params, limited_database_output_filename) - - ### Creates the save environment - def create_save_env(self): - create_folder_structure(self) - - ### Fits the model(s) using predetermined number of learning epochs - def fit_mlp_model(self, *args): - ### PROJECT ### - if self.folder_structs_built == True: - for build in range(0,self.vars['project_num_builds']): - if self.vars['project_print_feedback'] == True: - print("Build %d of %d"%(build+1,self.vars['project_num_builds'])) - for node in range(0,self.vars['project_num_nodes']): - if self.vars['project_print_feedback'] == True: - print("Node %d of %d"%(node+1,self.vars['project_num_nodes'])) - for trial in range(0,self.vars['project_num_trials']): - if self.vars['project_print_feedback'] == True: - print("Trial %d of %d"%(trial+1,self.vars['project_num_trials'])) - self.output_filepath = os.path.join(os.path.join(os.path.join(self.vars['project_name'], self.build_dirs[build]), self.node_dirs[build][node]), "model_output" + "_%d"%(trial + 1)) - self.model = create_model(self) - self.model.fit(self.data.learn_x, self.data.learn_y, self.vars['learning_rate'], self.vars['train_epochs']) - res = self.model.test_new(self.data.x) - self.model.save_net(self.output_filepath) - if 'shuffle_lv' in args: - self.data.shuffle('l', 'v', data_split = self.vars['data_split']) - elif 'shuffle_lvt' in args: - self.data.shuffle('l', 'v', 't', data_split = self.vars['data_split']) - ### SINGLE NET ### + mlp_model.fit( + self.packaged_data.learn_x, + self.packaged_data.learn_y, + self.vars['learning_rate'], + self.vars['train_epochs']) + mlp_model.save('./tmp/model_output') + + # Project is constructed, create models according to configuration specifications else: - self.model = create_model(self) - self.model.fit(self.data.learn_x, self.data.learn_y, self.vars['learning_rate'], self.vars['train_epochs']) - self.model.save_net("./tmp/model_output") - - ### Fits the model(s) using validation RMSE cutoff method, or max epochs - def fit_mlp_model_validation(self, *args): - ### PROJECT ### - if self.folder_structs_built == True: - for build in range(0,self.vars['project_num_builds']): - if self.vars['project_print_feedback'] == True: - print("Build %d of %d"%(build+1,self.vars['project_num_builds'])) - for node in range(0,self.vars['project_num_nodes']): - if self.vars['project_print_feedback'] == True: - print("Node %d of %d"%(node+1,self.vars['project_num_nodes'])) - for trial in range(0,self.vars['project_num_trials']): - if self.vars['project_print_feedback'] == True: - print("Trial %d of %d"%(trial+1,self.vars['project_num_trials'])) - self.output_filepath = os.path.join(os.path.join(os.path.join(self.vars['project_name'], self.build_dirs[build]), self.node_dirs[build][node]), "model_output" + "_%d"%(trial + 1)) - self.model = create_model(self) - self.model.fit_validation(self.data.learn_x, self.data.valid_x, self.data.learn_y, self.data.valid_y, self.vars['learning_rate'], self.vars['valid_mdrmse_stop'], self.vars['valid_mdrmse_memory'], self.vars['valid_max_epochs']) - res = self.model.test_new(self.data.x) - self.model.save_net(self.output_filepath) + # For each build: + for build in range(self.vars['project_num_builds']): + # For each node: + for node in range(self.vars['project_num_nodes']): + # For each trial: + for trial in range(self.vars['project_num_trials']): + # Print status update (if config variable is True) + if self.vars['project_print_feedback']: + print('Build %d, Node %d, Trial %d...' % (build + 1, node + 1, trial + 1)) + # Determine filepath where trial will be saved + path_b = os.path.join(self.vars['project_name'], 'build_%d' % build) + path_n = os.path.join(path_b, 'node_%d' % node) + path_t = os.path.join(path_n, 'trial_%d' % trial) + # Create the model, train the model, save the model to trial filepath + mlp_model = self.__create_mlp_model() + # Use validation sets to periodically test model's performance and determine when done training + if validate: + mlp_model.fit_validation( + self.packaged_data.learn_x, + self.packaged_data.learn_y, + self.packaged_data.valid_x, + self.packaged_data.valid_y, + self.vars['learning_rate'], + self.vars['valid_max_epochs']) + # No validation is used, just train for 'training_epochs' iterations + else: + mlp_model.fit( + self.packaged_data.learn_x, + self.packaged_data.learn_y, + self.vars['learning_rate'], + self.vars['train_epochs']) + mlp_model.save(path_t) + # Shuffle the training data sets if 'shuffle_lv' in args: - self.data.shuffle('l', 'v', data_split = self.vars['data_split']) + self.DataFrame.shuffle('l', 'v', split = self.vars['data_split']) + self.packaged_data = self.DataFrame.package_sets() + # Shuffle all data sets elif 'shuffle_lvt' in args: - self.data.shuffle('l', 'v', 't', data_split = self.vars['data_split']) - ### SINGLE NET ### - else: - self.model = create_model(self) - self.model.fit_validation(self.data.learn_x, self.data.valid_x, self.data.learn_y, self.data.valid_y, self.vars['learning_rate'], self.vars['valid_mdrmse_stop'], self.vars['valid_mdrmse_memory'], self.vars['valid_max_epochs']) - self.model.save_net("./tmp/model_output") - - ### Selects the best performing networks from each node of each build. Folder structs must be created. - def select_best(self, dset = None): - ### SINGLE MODEL ### - if self.folder_structs_built == False: - print("Error: Project folder structure must be built in order to select best.") - sys.exit() - ### PROJECT ### + self.DataFrame.create_sets(split = self.vars['data_split']) + self.packaged_data = self.DataFrame.package_sets() + + ''' + Selects the best performing model from each node for each build to represent + the node (build prediction = average of node predictions). Selection of best + model is based on data set *dset* performance. This method may take a while, + depending on project size. + ''' + def select_best(self, dset = None, error_fn = 'rmse'): + + # If not using a project, no need to call this function! + if not self.using_project: + raise Exception('ERROR: Project is not created; project structure is required to select best models!') + # Using a project else: - for i in range(0,self.vars['project_num_builds']): - for j in range(0,self.vars['project_num_nodes']): + # Print status update (if config variable is True) + print('Selecting best models from each node for each build...') + # Determine input values and target values to use for selection (based on dset arg) + x_vals = self.__determine_x_vals(dset) + y_vals = self.__determine_y_vals(dset) + # For each build: + for build in range(self.vars['project_num_builds']): + # Determine build path + path_b = os.path.join(self.vars['project_name'], 'build_%d' % build) + # For each node: + for node in range(self.vars['project_num_nodes']): + # Determine node path + path_n = os.path.join(path_b, 'node_%d' % node) + # List of trial errors within the node rmse_list = [] - for k in range(0,self.vars['project_num_trials']): - self.model_load_filename = os.path.join(os.path.join(self.vars['project_name'], "build_%d"%(i+1)),os.path.join("node_%d"%(j+1), "model_output" + "_%d"%(k+1))) - self.model = ecnet.model.multilayer_perceptron() - self.model.load_net(self.model_load_filename) - x_vals = determine_x_vals(self, dset) - y_vals = determine_y_vals(self, dset) - res = self.model.test_new(x_vals) - rmse = ecnet.error_utils.calc_rmse(res, y_vals) - rmse_list.append(rmse) + # For each trial: + for trial in range(self.vars['project_num_trials']): + # Create model, load trial, calculate error, append to list + mlp_model = ecnet.model.MultilayerPerceptron() + mlp_model.load(os.path.join(path_n, 'trial_%d' % trial)) + rmse_list.append(self.__error_fn(error_fn, mlp_model.use(x_vals), y_vals)) + # Determines the lowest error in error list current_min = 0 - for error in range(0,len(rmse_list)): - if rmse_list[error] < rmse_list[current_min]: - current_min = error - self.model_load_filename = os.path.join(os.path.join(self.vars['project_name'], "build_%d"%(i+1)),os.path.join("node_%d"%(j+1), "model_output" + "_%d"%(current_min+1))) - self.output_filepath = os.path.join(os.path.join(self.vars['project_name'], "build_%d"%(i+1)),os.path.join("node_%d"%(j+1),"final_net_%d"%(j+1))) - self.resave_net(self.output_filepath) - - ### Predicts values for the current test set data - def use_mlp_model(self, dset = None): - x_vals = determine_x_vals(self, dset) - ### SINGLE MODEL ### - if self.folder_structs_built == False: - self.model = ecnet.model.multilayer_perceptron() - self.model.load_net("tmp/model_output") - if self.vars['normals_use'] == True: - res = ecnet.data_utils.denormalize_result(self.model.test_new(x_vals), './tmp/normal_params') - else: - res = self.model.test_new(x_vals) - return [res] - - ### PROJECT ### + for new_min in range(len(rmse_list)): + if rmse_list[new_min] < rmse_list[current_min]: + current_min = new_min + # Load the model with the lowest error, resave as 'final_net' in the node folder + mlp_model = ecnet.model.MultilayerPerceptron() + mlp_model.load(os.path.join(path_n, 'trial_%d' % current_min)) + mlp_model.save(os.path.join(path_n, 'final_net')) + + ''' + Use trained neural network (multilayer perceptron), either single or build, + to predict values for specified *dset* + ''' + def use_model(self, dset = None): + + # Determine data set to be passed to model, specified by *dset* + x_vals = self.__determine_x_vals(dset) + # Not using project, use single model + if not self.using_project: + # Create model object + mlp_model = ecnet.model.MultilayerPerceptron() + # Load the trained model + mlp_model.load('./tmp/model_output') + # Return results obtained from model + return [mlp_model.use(x_vals)] + # Project is constructed, use project builds to predict values else: - final_preds = [] - # For each build - for i in range(0,self.vars['project_num_builds']): - predlist = [] - # For each node - for j in range(0,self.vars['project_num_nodes']): - self.model_load_filename = os.path.join(os.path.join(self.vars['project_name'], "build_%d"%(i+1)),os.path.join("node_%d"%(j+1),"final_net_%d"%(j+1))) - self.model = ecnet.model.multilayer_perceptron() - self.model.load_net(self.model_load_filename) - if self.vars['normals_use'] == True: - res = ecnet.data_utils.denormalize_result(self.model.test_new(x_vals), './tmp/normal_params') - else: - res = self.model.test_new(x_vals) - predlist.append(res) - finalpred = [] - # Check for one, or multiple outputs - if self.data.controls_num_outputs is 1: - for j in range(0,len(predlist[0])): - local_raw = [] - for k in range(len(predlist)): - local_raw.append(predlist[k][j]) - finalpred.append([np.mean(local_raw)]) - final_preds.append(finalpred) - else: - for j in range(len(predlist[0])): - build_ave = [] - for k in range(len(predlist)): - build_ave.append(predlist[k][j]) - finalpred.append(sum(build_ave)/len(build_ave)) - final_preds.append(list(finalpred)) - return final_preds - - ### Calculates errors for each given argument + # List of final predictions + preds = [] + # For each project build: + for build in range(self.vars['project_num_builds']): + # Determine build path + path_b = os.path.join(self.vars['project_name'], 'build_%d' % build) + # Build predictions (one from each node) + build_preds = [] + # For each node: + for node in range(self.vars['project_num_nodes']): + # Determine node path + path_n = os.path.join(path_b, 'node_%d' % node) + # Determine final build (from select_best) path + path_f = os.path.join(path_n, 'final_net') + # Create model, load net, append results + mlp_model = ecnet.model.MultilayerPerceptron() + mlp_model.load(path_f) + build_preds.append(mlp_model.use(x_vals)) + # Average build prediction = average of node predictions + ave_build_preds = [] + # For each data point + for point in range(len(build_preds[0])): + # List of node predictions for individual data point + local_pred = [] + # For each node prediction + for pred in range(len(build_preds)): + # Append node prediction for data point + local_pred.append(build_preds[pred][point]) + # Compute average of node predictions for point, append to ave list + ave_build_preds.append(sum(local_pred) / len(local_pred)) + # Append average build prediction to list of final predictions + preds.append(list(ave_build_preds)) + # Return final predictions + return preds + + ''' + Calculates and returns errors based on input *args; possible arguments are *rmse*, + *r2* (r-squared correlation coefficient), *mean_abs_error*, *med_abs_error*. Multiple + error arguments can be supplied. *dset* argument specifies which data set the error + is being calculated for (e.g. 'test', 'train'). Returns dictionary of error values. + ''' def calc_error(self, *args, dset = None): + # Dictionary keys = error arguments, values = error values error_dict = {} - preds = self.use_mlp_model(dset) - y_vals = determine_y_vals(self, dset) + # Obtain predictions for specified data set + y_hat = self.use_model(dset) + # Determine target values for specified data set + y = self.__determine_y_vals(dset) + # For each supplied error argument: for arg in args: - ### Single Model ### - if self.folder_structs_built == False: - if self.vars['normals_use'] == True: - rmse = error_fn(arg, preds, ecnet.data_utils.denormalize_result(y_vals, './tmp/normal_params')) - else: - rmse = error_fn(arg, preds, y_vals) - error_dict[arg] = rmse - ### PROJECT ### + # Using project + if self.using_project: + # List of error for each build + error_list = [] + # For each build's prediction: + for pred in y_hat: + # Append build error to error list + error_list.append(self.__error_fn(arg, pred, y)) + # Key error = error list + error_dict[arg] = error_list + # Using single model else: - rmse_list = [] - for i in range(0,len(preds)): - if self.vars['normals_use'] == True: - rmse_list.append(error_fn(arg, preds[i], ecnet.data_utils.denormalize_result(y_vals, './tmp/normal_params'))) - else: - rmse_list.append(error_fn(arg, preds[i], y_vals)) - error_dict[arg] = rmse_list + # Key error = calculated error + error_dict[arg] = self.__error_fn(arg, y_hat, y) + # Return error dictionary return error_dict - - ### Outputs results to desired .csv file - def output_results(self, results, filename, dset = None): - if self.vars['normals_use'] is True: - normalized_data = self.data.y[:] - normalized_learn = self.data.learn_y[:] - normalized_valid = self.data.valid_y[:] - normalized_test = self.data.test_y[:] - self.data.y = ecnet.data_utils.denormalize_result(self.data.y, './tmp/normal_params') - self.data.learn_y = ecnet.data_utils.denormalize_result(self.data.learn_y, './tmp/normal_params') - self.data.valid_y = ecnet.data_utils.denormalize_result(self.data.valid_y, './tmp/normal_params') - self.data.test_y = ecnet.data_utils.denormalize_result(self.data.test_y, './tmp/normal_params') - ecnet.data_utils.output_results(results, self.data, filename, dset) - if self.vars['normals_use'] is True: - self.data.y = normalized_data - self.data.learn_y = normalized_learn - self.data.valid_y = normalized_valid - self.data.test_y = normalized_test - - ### Resaves the file under 'self.model_load_filename' to specified output filepath - def resave_net(self, output): - self.model = ecnet.model.multilayer_perceptron() - self.model.load_net(self.model_load_filename) - self.model.save_net(output) - - ### Cleans up the project directory (only keep final node NN's), copies the config, data and normal params (if present) files to the directory, and zips the directory for publication - def publish_project(self): - # Clean up project directory - for build in range(self.vars['project_num_builds']): - for node in range(self.vars['project_num_nodes']): - directory = os.path.join(self.vars['project_name'], os.path.join('build_%d'%(build+1), 'node_%d'%(node+1))) - filelist = [f for f in os.listdir(directory) if 'model_output' in f] - for f in filelist: - os.remove(os.path.join(directory, f)) - # Copy config.yml and normal parameters file to the project directory - save_config(self.vars) - copyfile('config.yml', os.path.join(self.vars['project_name'], 'config.yml')) - if self.vars['normals_use'] is True: - copyfile('./tmp/normal_params.ecnet', os.path.join(self.vars['project_name'], 'normal_params.ecnet')) - # Export the currently loaded dataset (if loaded) - try: - data_file = open(os.path.join(self.vars['project_name'],'data.d'),'wb') - pickle.dump(self.data, data_file) - data_file.close() - except: - pass - # Zip up the project - zipf = zipfile.ZipFile(self.vars['project_name'] + '.project', 'w', zipfile.ZIP_DEFLATED) + + ''' + Outputs the *results* obtained from "use_model()" to a specified *filename* + ''' + def output_results(self, results, filename = 'my_results.csv'): + + # Output results using data_utils function + ecnet.data_utils.output_results(results, self.DataFrame, filename) + + ''' + Saves the current state of Server (including currently imported DataFrame and configuration), + cleans up the project directory if specified in *clean_up* (only keeps final node models), + and zips up the current state and project directory into a .project file + ''' + def save_project(self, clean_up = True): + + # If removing trials from project directory (keeping final models): + if clean_up: + # For each build + for build in range(self.vars['project_num_builds']): + path_b = os.path.join(self.vars['project_name'], 'build_%d' % build) + # For each node + for node in range(self.vars['project_num_nodes']): + path_n = os.path.join(path_b, 'node_%d' % node) + # Remove trials + trial_files = [file for file in os.listdir(path_n) if 'trial' in file] + for file in trial_files: + os.remove(os.path.join(path_n, file)) + + # Save Server configuration to configuration YAML file + with open(self.config_filename, 'w') as config_file: + yaml.dump(self.vars, config_file, default_flow_style = False, explicit_start = True) + config_file.close() + + # Save currently loaded DataFrame + with open(os.path.join(self.vars['project_name'], 'data.d'), 'wb') as data_file: + pickle.dump(self.DataFrame, data_file) + data_file.close() + + # Zip up all files in project directory, save to .project file + zip_file = zipfile.ZipFile(self.vars['project_name'] + '.project', 'w', zipfile.ZIP_DEFLATED) for root, dirs, files in os.walk(self.vars['project_name']): for file in files: - zipf.write(os.path.join(root,file)) - zipf.close() - - ### Opens a published project, importing model, data, config, normal params - def open_project(self, project_name): - # Check naming scheme - if '.project' not in project_name: - zip_loc = project_name + '.project' + zip_file.write(os.path.join(root, file)) + zip_file.close() + + ''' + Opens a .project file, imports configuration and last used data set, unzips model files + to project directory + ''' + def open_project(self, filename): + + # Check for .project file format + if '.project' not in filename: + filename += '.project' + + # Extract project directory from .project file + zip_file = zipfile.ZipFile(filename, 'r') + zip_file.extractall('./') + zip_file.close() + + # Import project configuration + with open(self.config_filename, 'r') as config_file: + self.vars.update(yaml.load(config_file)) + config_file.close() + + # Import last used DataFrame + with open(os.path.join(self.vars['project_name'], 'data.d'), 'rb') as data_file: + self.DataFrame = pickle.load(data_file) + data_file.close() + + # Package data for model usage + self.packaged_data = self.DataFrame.package_sets() + + # Set project use boolean to True + self.using_project = True + + ''' + PRIVATE METHOD: Helper function for determining data set *dset* to be passed to the model (inputs) + ''' + def __determine_x_vals(self, dset): + + # Use the test set + if dset == 'test': + return self.packaged_data.test_x + # Use the validation set + elif dset == 'valid': + return self.packaged_data.valid_x + # Use the learning set + elif dset == 'learn': + return self.packaged_data.learn_x + # Use training set (learning and validation) + elif dset == 'train': + x_vals = [] + for val in self.packaged_data.learn_x: + x_vals.append(val) + for val in self.packaged_data.valid_x: + x_vals.append(val) + return np.asarray(x_vals) + # Use all data sets (learning, validation and testing) else: - project_name = project_name.replace('.project', '') - zip_loc = project_name + '.project' - # Unzip project to directory - zip_ref = zipfile.ZipFile(zip_loc, 'r') - zip_ref.extractall('./') - zip_ref.close() - # Update config to project config - self.vars.update(import_config(filename = os.path.join(project_name,'config.yml'))) - save_config(self.vars) - # Unpack data - try: - self.data = pickle.load(open(os.path.join(project_name, 'data.d'),'rb')) - except: - print('Error: unable to load data.') - pass - # Set up model environment - self.folder_structs_built = True - create_folder_structure(self) - self.model = create_model(self) - - ### Optimizes and tunes the the hyperparameters for ecnet - def tune_hyperparameters(self, target_score = None, iteration_amount = 50, amount_of_employers = 50): - # Check which arguments to use to terminate artifical bee colony, then create the ABC object - if target_score == None: - abc = ecnet.abc.ABC(iterationAmount = iteration_amount, fitnessFunction=runNeuralNet, valueRanges=ecnetValues, amountOfEmployers=amount_of_employers) + x_vals = [] + for val in self.packaged_data.learn_x: + x_vals.append(val) + for val in self.packaged_data.valid_x: + x_vals.append(val) + for val in self.packaged_data.test_x: + x_vals.append(val) + return np.asarray(x_vals) + + ''' + PRIVATE METHOD: Helper function for determining data set *dset* to be passed to the model (targets) + ''' + def __determine_y_vals(self, dset): + + # Use the test set + if dset == 'test': + return self.packaged_data.test_y + # Use the validation set + elif dset == 'valid': + return self.packaged_data.valid_y + # Use the learning set + elif dset == 'learn': + return self.packaged_data.learn_y + # Use training set (learning and validation) + elif dset == 'train': + y_vals = [] + for val in self.packaged_data.learn_y: + y_vals.append(val) + for val in self.packaged_data.valid_y: + y_vals.append(val) + return np.asarray(y_vals) + # Use all data sets (learning, validation and testing) else: - abc = ecnet.abc.ABC(endValue = target_score, fitnessFunction=runNeuralNet, valueRanges=ecnetValues, amountOfEmployers=amount_of_employers) - # Run the artificial bee colony and return the resulting hyperparameter values - hyperparams = abc.runABC() - # Assign the hyperparameters generated from the artificial bee colony to ecnet - self.vars['learning_rate'] = hyperparams[0] - self.vars['valid_mdrmse_stop'] = hyperparams[1] - self.vars['valid_max_epochs'] = hyperparams[2] - self.vars['valid_mdrmse_memory'] = hyperparams[3] - self.vars['mlp_hidden_layers[0][0]'] = hyperparams[4] - self.vars['mlp_hidden_layers[1][0]'] = hyperparams[5] - return hyperparams - -# Creates the default folder structure, outlined in the file config by number of builds and nodes. -def create_folder_structure(server_obj): - server_obj.build_dirs = [] - for build_dirs in range(0,server_obj.vars['project_num_builds']): - server_obj.build_dirs.append('build_%d'%(build_dirs + 1)) - server_obj.node_dirs = [] - for build_dirs in range(0,server_obj.vars['project_num_builds']): - local_nodes = [] - for node_dirs in range(0,server_obj.vars['project_num_nodes']): - local_nodes.append('node_%d'%(node_dirs + 1)) - server_obj.node_dirs.append(local_nodes) - for build in range(0,len(server_obj.build_dirs)): - path = os.path.join(server_obj.vars['project_name'], server_obj.build_dirs[build]) - if not os.path.exists(path): - os.makedirs(path) - for node in range(0,len(server_obj.node_dirs[build])): - node_path = os.path.join(path, server_obj.node_dirs[build][node]) - if not os.path.exists(node_path): - os.makedirs(node_path) - server_obj.folder_structs_built = True - -# Used by use_mlp_model to determine which x-values to use for calculations -def determine_x_vals(server, dset): - if dset is 'test': - x_vals = server.data.test_x - elif dset is 'learn': - x_vals = server.data.learn_x - elif dset is 'valid': - x_vals = server.data.valid_x - elif dset is 'train': - x_vals = [] - for i in range(len(server.data.learn_x)): - x_vals.append(list(server.data.learn_x[i])) - for i in range(len(server.data.valid_x)): - x_vals.append(list(server.data.valid_x[i])) - else: - x_vals = server.data.x - return x_vals - -# Used by calc_error to determine which y-values to use for error calculations -def determine_y_vals(server, dset): - if dset is 'test': - y_vals = server.data.test_y - elif dset is 'learn': - y_vals = server.data.learn_y - elif dset is 'valid': - y_vals = server.data.valid_y - elif dset is 'train': - y_vals = [] - for i in range(len(server.data.learn_y)): - y_vals.append(list(server.data.learn_y[i])) - for i in range(len(server.data.valid_y)): - y_vals.append(list(server.data.valid_y[i])) - else: - y_vals = server.data.y - return y_vals - -# Used by calc_errpr to determine which error is being calculated; returns user defined error calculation -def error_fn(arg, y_hat, y): - if arg is 'rmse': - return ecnet.error_utils.calc_rmse(y_hat, y) - elif arg is 'r2': - return ecnet.error_utils.calc_r2(y_hat, y) - elif arg is 'mean_abs_error': - return ecnet.error_utils.calc_mean_abs_error(y_hat, y) - elif arg is 'med_abs_error': - return ecnet.error_utils.calc_med_abs_error(y_hat, y) - else: - print("Error: unknown/unsupported error function: " + arg) - -# Creates a model using config.yaml -def create_model(server_obj): - net = ecnet.model.multilayer_perceptron() - net.addLayer(len(server_obj.data.x[0]), server_obj.vars['mlp_in_layer_activ']) - for hidden in range(0,len(server_obj.vars['mlp_hidden_layers'])): - net.addLayer(server_obj.vars['mlp_hidden_layers'][hidden][0], server_obj.vars['mlp_hidden_layers'][hidden][1]) - net.addLayer(len(server_obj.data.y[0]), server_obj.vars['mlp_out_layer_activ']) - net.connectLayers() - return net - -# Imports 'config.yaml'; creates a default file if none is found -def import_config(filename = 'config.yml'): - try: - stream = open(filename, 'r') - return(yaml.load(stream)) - except: - create_default_config() - stream = open(filename, 'r') - return(yaml.load(stream)) - -# Saves all server variables to config.yml -def save_config(config_dict): - with open('config.yml', 'w') as outfile: - yaml.dump(config_dict, outfile, default_flow_style = False, explicit_start = True) - -# Creates a default 'config.yaml' file -def create_default_config(): - stream = open('config.yml', 'w') - config_dict = { - 'data_filename' : 'data.csv', - 'data_sort_type' : 'random', - 'data_split' : [0.65,0.25,0.10], - 'learning_rate' : 0.1, - 'mlp_hidden_layers' : [[5, 'relu'], [5, 'relu']], - 'mlp_in_layer_activ' : 'relu', - 'mlp_out_layer_activ' : 'linear', - 'normals_use' : False, - 'project_name' : 'my_project', - 'project_num_builds' : 1, - 'project_num_nodes' : 1, - 'project_num_trials' : 1, - 'project_print_feedback': True, - 'train_epochs' : 100, - 'valid_max_epochs': 1000, - 'valid_mdrmse_stop' : 0.1, - 'valid_mdrmse_memory' : 1000 - } - yaml.dump(config_dict,stream) - -def runNeuralNet(values): - # Run the ecnet server - config_file = import_config() - sv = Server() - sv.vars['learning_rate'] = values[0] - sv.vars['valid_mdrmse_stop'] = values[1] - sv.vars['valid_max_epochs'] = values[2] - sv.vars['valid_mdrmse_memory'] = values[3] - sv.vars['mlp_hidden_layers[0][0]'] = values[4] - sv.vars['mlp_hidden_layers[1][0]'] = values[5] - sv.vars['data_filename'] = config_file['data_filename'] - - sv.import_data(sv.vars['data_filename']) - sv.fit_mlp_model_validation('shuffle_lv') - test_errors = sv.calc_error('rmse') - sv.publish_project() - return test_errors['rmse'] + y_vals = [] + for val in self.packaged_data.learn_y: + y_vals.append(val) + for val in self.packaged_data.valid_y: + y_vals.append(val) + for val in self.packaged_data.test_y: + y_vals.append(val) + return np.asarray(y_vals) + + ''' + PRIVATE METHOD: Helper function for creating a neural network (multilayer perceptron) + ''' + def __create_mlp_model(self): -ecnetValues = [('float', (0.001, 0.1)), ('float', (0.000001,0.01)), ('int', (1250, 2500)), ('int', (500, 2500)), ('int', (12,32)), ('int', (12,32))] + # Create the model object + mlp_model = ecnet.model.MultilayerPerceptron() + # Add input layer, size = number of data inputs, activation function specified in configuration file + mlp_model.add_layer(self.DataFrame.num_inputs, self.vars['mlp_in_layer_activ']) + # Add hidden layers, sizes and activation functions specified in configuration file + for hidden in range(len(self.vars['mlp_hidden_layers'])): + mlp_model.add_layer(self.vars['mlp_hidden_layers'][hidden][0], self.vars['mlp_hidden_layers'][hidden][1]) + # Add output layer, size = number of data targets, activation function specified in configuration file + mlp_model.add_layer(self.DataFrame.num_targets, self.vars['mlp_out_layer_activ']) + # Connect layers (compute initial weights and biases) + mlp_model.connect_layers() + # Return the model object + return mlp_model + + ''' + PRIVATE METHOD: used to parse error argument, calculate specified error and return it + ''' + def __error_fn(self, arg, y_hat, y): + if arg == 'rmse': + return ecnet.error_utils.calc_rmse(y_hat, y) + elif arg == 'r2': + return ecnet.error_utils.calc_r2(y_hat, y) + elif arg == 'mean_abs_error': + return ecnet.error_utils.calc_mean_abs_error(y_hat, y) + elif arg == 'med_abs_error': + return ecnet.error_utils.calc_med_abs_error(y_hat, y) + else: + raise Exception('ERROR: Unknown/unsupported error function') diff --git a/examples/README.md b/examples/README.md index cd8275e..495b611 100644 --- a/examples/README.md +++ b/examples/README.md @@ -1,11 +1,7 @@ # ECNet Examples -### Here are brief descriptions of the example scripts: +### Here are brief descriptions of the example scripts/files: - - **config.yml**: example ECNet configuration file, set up for a cetane number prediction project - - **create_project.py**: creating the project environment, importing data, training trial ANN's with validation, selecting the best performing trial for each build node, and publishing the project to a '.project' file - - **use_project.py**: opening a published project, handing the trained model(s) a new database to predict values for, obtaining and saving the results, and calculating and listing various metrics of error/accuracy regarding the predicted values (if true values are known) - - **limit_db_parameters.py**: imports a database, reduces the input dimensionality using a "retain the best" algorithm, and saves the reduced database to a specified file - - **create_static_test_set.py**: Imports a dataset, and creates two files; one containing the test data, one containing the training (learning + validation) data; set sizes are determined by 'data_split' server variable - - **select_from_test_set_performance.py**: Select best trial from each node using static test set performance - - **abc_script.py**: Select an optimal set of values given a fitness function and a set of value ranges + - **config.yml**: example ECNet configuration file + - **limit_input_parameter.py**: example script for reducing the input dimensionality of supplied database + - **tune_hyperparameters.py**: example script for tuning various neural network hyperparameters, such as learning rate diff --git a/examples/abc_script.py b/examples/abc_script.py deleted file mode 100644 index 5ee63b2..0000000 --- a/examples/abc_script.py +++ /dev/null @@ -1,23 +0,0 @@ -""" -EXAMPLE SCRIPT: -Find optimal values for a given set of value ranges, and a fitness function - -Save the scores of each iteration in a text file called scores.txt -""" - -from ecnet.abc import ABC - -# Define a fitness function -def fitnessTest(values): - fit = 0 - for val in values: - fit+=val - return fit - -# Define a set of value ranges with types attached to them -values = [('int', (0,100)), ('int', (0,100)), ('int',(0,100)), ('float', (10,1000))] - -# Create the abc object -abc = ABC(fitnessFunction = fitnessTest, amountOfEmployers = 100, valueRanges = values, endValue = 5) -# Run the colony until the fitness score reaches 5 or less -abc.runABC() diff --git a/examples/config.yml b/examples/config.yml index b917cc5..75504e6 100644 --- a/examples/config.yml +++ b/examples/config.yml @@ -1,11 +1,11 @@ --- -data_filename: cn_model_v1.0.csv +data_filename: my_data.csv data_sort_type: random data_split: -- 0.65 -- 0.25 +- 0.7 +- 0.2 - 0.1 -learning_rate: 0.05 +learning_rate: 0.1 mlp_hidden_layers: - - 32 - relu @@ -14,12 +14,10 @@ mlp_hidden_layers: mlp_in_layer_activ: relu mlp_out_layer_activ: linear normals_use: false -project_name: cn_v1.0_project +project_name: my_project project_num_builds: 1 -project_num_nodes: 1 -project_num_trials: 5 +project_num_nodes: 5 +project_num_trials: 75 project_print_feedback: true -train_epochs: 2500 -valid_max_epochs: 7500 -valid_mdrmse_memory: 250 -valid_mdrmse_stop: 0.00007 +train_epochs: 1000 +valid_max_epochs: 20000 diff --git a/examples/create_project.py b/examples/create_project.py deleted file mode 100644 index 6d0a1fb..0000000 --- a/examples/create_project.py +++ /dev/null @@ -1,28 +0,0 @@ -""" -EXAMPLE SCRIPT: -Project creation and publishing - -Creates a machine learning project using Server, creates the project save environment, -imports the training dataset, creates and fits models, selects the best models for each -build node, and publishes the project to a '.project' file -""" - -from ecnet.server import Server - -# Create the Server -sv = Server() - -# Create project save environment -sv.create_save_env() - -# Import data from config. file database -sv.import_data() - -# Fits models for each node in each build, shuffling learn and validate sets between trials -sv.fit_mlp_model_validation('shuffle_lv') - -# Selects the best performing model from each build's node -sv.select_best() - -# Saves the project environment to a '.project' file -sv.publish_project() diff --git a/examples/create_static_test_set.py b/examples/create_static_test_set.py deleted file mode 100644 index e36f32a..0000000 --- a/examples/create_static_test_set.py +++ /dev/null @@ -1,26 +0,0 @@ -""" -EXAMPLE SCRIPT: -Creates a static test - -Imports a dataset, and creates two files; one containing the test data, - one containing the training (learning + validation) data. Set sizes - are determined by 'data_split' server variable. -""" - -from ecnet.server import Server -from ecnet import data_utils - -# Create the Server -sv = Server() - -# [learn%, validation%, test%] *** 10% of data will be used for a static test set -sv.vars['data_split'] = [0.7, 0.2, 0.1] - -# Imports database -sv.import_data('original_database.csv') - -# Creates a static test set using randomly imported test proportion -# output files will be: -# original_database_slv.csv -# original_database_st.csv -data_utils.create_static_test_set(sv.data) \ No newline at end of file diff --git a/examples/limit_db_parameters.py b/examples/limit_db_parameters.py deleted file mode 100644 index 89dbfaa..0000000 --- a/examples/limit_db_parameters.py +++ /dev/null @@ -1,18 +0,0 @@ -""" -EXAMPLE SCRIPT: -Reduces database input parameter dimensionality - -Imports a dataset, and reduces the number of input parameters to a specified number based -on input parameter performance -""" - -from ecnet.server import Server - -# Create the Server -sv = Server() - -# Imports config. file database -sv.import_data() - -# Limits input dimensionality to 15, saves to specified file -sv.limit_parameters(15, 'limited_database.csv') diff --git a/examples/limit_input_parameters.py b/examples/limit_input_parameters.py new file mode 100644 index 0000000..ab12b84 --- /dev/null +++ b/examples/limit_input_parameters.py @@ -0,0 +1,15 @@ +# Import the Server object +from ecnet.server import Server + +# Create Server +sv = Server() + +# Import data (change 'my_data.csv' to your database name) +sv.import_data('my_data.csv') + +# Limit the input dimensionality to 15, save to 'my_data_limited.csv' +sv.limit_parameters(15, 'my_data_limited.csv') + + +# Use this line instead for limiting the input dimensionality using a genetic algorithm +sv.limit_parameters(15, 'my_data_limited_genetic.csv', use_genetic = True) diff --git a/examples/select_from_test_set_performance.py b/examples/select_from_test_set_performance.py deleted file mode 100644 index c52641b..0000000 --- a/examples/select_from_test_set_performance.py +++ /dev/null @@ -1,55 +0,0 @@ -""" -EXAMPLE SCRIPT: -Select best trial from each node using static test set performance - -This script operates as follows: - Train models using training (learning + validation) set - Import test set, select best trials using test set performance - Obtain results and errors for test set - Obtain results and errors for training set - Publish project -""" - -from ecnet.server import Server - -# create server object -sv = Server() -# create the project folder structure -sv.create_save_env() - -# set data split ([learn, validate, test]) and import the training set -sv.vars['data_split'] = [0.7, 0.3, 0.0] -sv.import_data('slv_data.csv') - -# fit the model to the training set, shuffling learn and validate sets between trials -sv.fit_mlp_model_validation('shuffle_lv') - -# import the test set, select best trials based on test set performance -sv.import_data('st_data.csv') -sv.select_best() - -# obtain predictions for test set -test_results = sv.use_mlp_model() -sv.output_results(test_results, filename = 'test_set_results.csv') - -# obtain and print errors for test set -test_errors = sv.calc_error('rmse', 'r2', 'mean_abs_error', 'med_abs_error') -print() -print('Test Errors:') -print(test_errors) -print() - -# obtain predictions for training set -sv.import_data('slv_data.csv') -train_results = sv.use_mlp_model() -sv.output_results(train_results, filename = 'training_set_results.csv') - -# obtain and print errors for training set -train_errors = sv.calc_error('rmse', 'r2', 'mean_abs_error', 'med_abs_error') -print() -print('Training Errors:') -print(train_errors) -print() - -# publish (i.e. save the state) of current project -sv.publish_project() \ No newline at end of file diff --git a/examples/tune_hyperparameters.py b/examples/tune_hyperparameters.py new file mode 100644 index 0000000..115f83d --- /dev/null +++ b/examples/tune_hyperparameters.py @@ -0,0 +1,25 @@ +# Import the Server object +from ecnet.server import Server + +# Create Server +sv = Server() + +# Create ECNet project +sv.create_project('my_project') + +# Import data (change 'my_data.csv' to your database name) +sv.import_data('my_data.csv') + +# Tune neural network hyperparemters (learning rate, maximum +# training epochs during validation, number of neurons in +# each hidden layer) +hp = sv.tune_hyperparameters() + +# Print the tuned hyperparameters +print(hp) + +# Tuned hyperparameters are now ready to be used! +sv.train_model('shufflelv', validate = True) + +# Save ECNet project +sv.save_project() \ No newline at end of file diff --git a/examples/use_project.py b/examples/use_project.py deleted file mode 100644 index a7e7ffd..0000000 --- a/examples/use_project.py +++ /dev/null @@ -1,29 +0,0 @@ -""" -EXAMPLE SCRIPT: -Using a pre-existing project to obtain results - -Imports a pre-existing project to the Server environment, imports a testing -dataset, obtain results and errors for testing dataset -""" - -from ecnet.server import Server - - -# Create the Server -sv = Server() - -# Opens pre-existing project -sv.open_project('cn_v1.0_project') - -# Import a new set to predict values for -sv.import_data('testing_data.csv') - -# Grab results (for whole set) -results = sv.use_mlp_model() - -# Save results to file -sv.output_results(results, filename='testing_results.csv') - -# Compute and print errors (for whole set) -errors = sv.calc_error('rmse','r2','mean_abs_error','med_abs_error') -print(errors) diff --git a/misc/README.md b/misc/README.md deleted file mode 100644 index a37d00a..0000000 --- a/misc/README.md +++ /dev/null @@ -1 +0,0 @@ -Miscellaneous files diff --git a/misc/build_figure.png b/misc/build_figure.png deleted file mode 100644 index ec57641..0000000 Binary files a/misc/build_figure.png and /dev/null differ diff --git a/setup.py b/setup.py index 3db94f1..c30d532 100644 --- a/setup.py +++ b/setup.py @@ -1,12 +1,12 @@ from setuptools import setup setup(name = 'ecnet', -version = "1.3.0.dev1", +version = "1.4.0", description = 'UMass Lowell Energy and Combustion Research Laboratory Neural Network Software', url = 'http://github.com/tjkessler/ecnet', author = 'Travis Kessler, Hernan Gelaf-Romer, Sanskriti Sharma', -author_email = 'Travis_Kessler@student.uml.edu, Hernan_Gelafromer@student.uml.edu, Sanskriti_Sharma@student.uml.edu', +author_email = 'travis.j.kessler@gmail.com, Hernan_Gelafromer@student.uml.edu, Sanskriti_Sharma@student.uml.edu', license = 'MIT', packages = ['ecnet'], -install_requires = ["tensorflow","pyyaml", "numpy"], +install_requires = ["tensorflow", "pyyaml", "numpy", "ecabc", "pygenetics"], zip_safe = False)