You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I have some issues when I tried to develop SQLFlow models:
Analysts usually use Dataframe to manipulate data and use it as input to the Keras model. It is convenient to debug, but SQLFlow tf-codegen uses dataset, which requires additional learning costs.
It is troublesome to connect with SQLFlow. For models configured under SQLFlow models, if you want to debug locally, you need to implement a train.py yourself, including reading data, defining feature columns, etc.. However, train.py generated locally and train.py generated by SQLFlow do not always behave consistently.
Usually an analysis task includes feature engineering -> data preprocessing -> model training (prediction). At present, the model zoo only includes the last step. But actually, sharing model between operations is a chain that needs to share the entire data processing. I hope that SQLFlow will also have the ability to do custom data preprocessing and be included in the design of the model zoo.
The text was updated successfully, but these errors were encountered:
Converting Pandas data frame to TensorFlow dataset can be done in one function, so the conversion should be straight forward. Also, I am wondering why it is hard to debug using the dataset. In both data frame and dataset mode, we can set a breakpoint in mymodel.call function to debug.
I really appreciate your consideration in integrating newly contributed models to SQLFlow. However, I am hesitated to add SQLFlow specific logic in this repo as the model definition (this repo) and runtime engine (SQLFlow, EDL, etc.) should be decoupled. In terms of testing the models, may I ask what kind of models are you trying to contribute?
If the models are like DNNClassifier which contains only two functions __init__ and call, I don't think we need to write a standalone train.py to test it. You can test it using tests/base.py as tests/test_dnnclassifier.py did.
If the models are like DeepEmbeddingClusterModel which wraps special training logic. Then we can work together to figure out a standard training API that both models reop and sqlflow repo should follow. We will materialize the standard to a base testing class like tests/base.py. If all model tests derived from that base class pass, SQLFlow should guarantee its train.py will also pass.
To sum up, models should be tested using tests/base.py. There is no need to write train.py to test models. SQLFlow's train.py is developed based on tests/base.py and SQLFLow should be responsible for integrating the models.
I totally agree that the sharing model only works if feature engineering is shared along with it. Is it possible to represent feature engineering in several SQL statements and share them along with the select ... to train some_model_from_model_zoo ... statement?
@Echo9573 doing data pre-processing using SQL is currently under the design phase, please take a look at sql-machine-learning/elasticdl#1477 and we'll appreciate your comments and advises.
Hi, I have some issues when I tried to develop SQLFlow models:
The text was updated successfully, but these errors were encountered: