Skip to content

Latest commit

 

History

History
80 lines (41 loc) · 8.96 KB

InsideMindsDB.md

File metadata and controls

80 lines (41 loc) · 8.96 KB

<Back to Table of Contents

Inside MindsDB

Different transactions PREDICT, CREATE MODEL etc, require different steps/phases, however they may share some of these phases, in order to make this process modular we keep the variables in the Transaction controller (the data bus) as they communication interface, as such, the implementation of a given phase can change, so long the expected variables in the bus prevail. (We will describe in more detail some of the Phase Modules in the next section)

  • DataExtractor: It deals with taking a query and pulling the data from the various data-sources implied in the query, building the joins (if any) and loading the full result into memory. NOTE: That as of now mindsDB requires that the full dataset can be loaded into memory. To add flexibility right now we support Apache Drill as a data aggregator.

  • StatsGenerator: Once the data is pulled and aggregated from the various data sources, MindsDB runs an analysis of each of the columns of the corpus, their distributions and statistical properties, the counts, etc, .. This information is to be used by later phases, the goal here is to run as much analysis on the data as possible so that later phases don't have to repeat calculations and computations over the dataset or sub-dataset they are working on. One example of this is when we normalize a value j we want to make normalized(j)=(j-mean)/range, if every single time we run normalized(j) we had to calculate the mean and range of the column that j belongs to, it will be too expensive computationally, so calculating and storing those values in the transaction BUS makes things efficient.

  • StatsLoader: There are some transaction such as PREDICT where its assumed that the statistical information is already known, all we have to do is make sure we load the right statistics to the transaction BUS.

  • DataVectorizer: In this phase the idea is to translate each column into numpy tensors that can be ingested by data models. Tensors are made of vector representations of each cell. This involves to understand what transformations are necessary depending on the column type. Currently the following column data types are supported Note that these can be expanded or updated for various needs:

    • Categorical:

      • text-tokens: columns where the values are TEXT but the distribution of the text is made of workds or combinations of words and the number of uniques does not exceed 10% of the total number of rows.

      • numeric-tokens: columns where the values are NUMERIC but the number of uniques does not exceed 10% of the total number of rows.

    • Continuous:

      • numeric: These are NUMERIC values that don't match the criteria of numeric-tokens or date-time.

      • date-time: These are values that are in fact timestamps as flagged by the datastore or TEXT recognized as a datetime string, and thus can be converted into timestamp.

    • Sequential:

      These are TEXT values where it doesnt fit the text-tokens classification. Or lists/arrays of values. Its vector representation is the last hidden state of an encoder (See next section).

  • DataEncoder: This step aims to reduce dimensionality of each vector representation as well as pass an encoded state of sequential data, this is so that in further steps all columns Tensors can be if desired, concatenated and passed as input to a model.

    • RNNEncoder: It is used to encode sequential data using the latest hidden state of a recurrent neural network This encoder has as a hyper parameter the type of recurrent neural network to be used as well as the hidden state $h_N$ size. The topology that we use now are GRU.

    • FullyConnectedNetEncoder: This tries to reduce the dimensionality of the input using a two fully connected layer, asumming that one imput row belongs to $I!R^N$, then the middle layer has N neurons and the second layer has $I!R^M$, where is M is the dimension of the output/target.

  • ModelTrainer: The Model Trainer uses the tensor representations of the columns and instanciates Train and Test Samplers (A Sampler allows to fetch data from the Column Tensor by batches and it can be lopped by epochs, it provides an abstraction that is independent from the ML Framework). It also instanciates, trains and validates various model constructs (essentially the way that models are coded in MindsDB is as Meta-models, MindsDB ships with some general meta-models, however, advanced users can add any meta-model they want so long is coded in either pytorch or tensorflow. The structure of the resulting meta-models are dependant on the Sampler Input and Output structures) and each also has a flexible number of configurations/hyper-parameters.

    • FullyConnectedNet: This takes the input as a concatenation of all of the input tensors, which in turn are the outputs of the encoders for each column, so assuming that there are $M$ columns in the input $I$, and that the Output $O \in I!R^U$, make $I \in I!R^{UxM}$. Another hyper-parameter is the number of layers, which can be any of {$3,6,9$} where the middle layer has a a dimensionality of $\in I!R^{U/2}$.

    • EnsembleConvNet: This architecture is an ensemble of each input being connected to a fully connected layer with output of the same size as the target $\in I!R^U$, then as each ensemble net has the same size, we plug the concatenation of the ensemble outputs, to a stack of convolutional layers, as we assume that those convolutions can learn features from the ensembles, the depth can be any of {$2,4,6$} and the number of filters per layer can be any from {$N/2,..N$}, $N$ being the number of columns in the input. and finally over a fully connected layer with a linear output, what is key here is that the loss, assuming the output of each ensemble $O_{i}$ where $i \in {1,..,N}$. The output of the final fully connected layer is $O_{net}$, assuming a loss function $f(O_{model}, O_{real})$ such as $RMSE$, the loss for the full network is defined as $net_{loss}=f(O_{net},O_{real})+\frac{1}{N}{\sum_{i=1}^N{f(O_{i},O_{real})}}$.

    • EnsembleFullyConnectedNet: This architecture is simiar to the ensemble conv net, with the exception that it has no convolutional layers from ensemble it goes straight to a fully connected stack. The calculation of the loss is the same as described in ensemble conv net.

  • ModelPredictor: The model predictor is called when the transaction is a PREDICT transaction. It loads the model with the highest $R^2$, the lookup for the models available is the columns in the input and output, it will look for models that match the same order in column names and data types. Once the Predictions are done, it replaces the predicted values in an output tensor (which is a copy of the input tensor).

  • DataDeVectorizer: Once the output data is ready and updated with the predictions, it proceeds to denomalize each vector that corresponds to a cell and produces a list of lists that contains the out, which will be taken by the proxy and returned to the client as if the data excited in the data store. Unless specified, it also adds a column for confidence, which is pulled from the training stats of the model, in which it can determine $P(O_{predicted}=O_{real})$ and we produce as the confidence of the individual prediction.