Some problems in model development #28

Echo9573 · 2019-11-19T08:00:27Z

Hi, I have some issues when I tried to develop SQLFlow models:

Analysts usually use Dataframe to manipulate data and use it as input to the Keras model. It is convenient to debug, but SQLFlow tf-codegen uses dataset, which requires additional learning costs.
It is troublesome to connect with SQLFlow. For models configured under SQLFlow models, if you want to debug locally, you need to implement a train.py yourself, including reading data, defining feature columns, etc.. However, train.py generated locally and train.py generated by SQLFlow do not always behave consistently.
Usually an analysis task includes feature engineering -> data preprocessing -> model training (prediction). At present, the model zoo only includes the last step. But actually, sharing model between operations is a chain that needs to share the entire data processing. I hope that SQLFlow will also have the ability to do custom data preprocessing and be included in the design of the model zoo.

Yancey1989 · 2019-11-19T09:16:45Z

@Echo9573 for the second point, I have a PR about Couler function: https://github.com/sql-machine-learning/sqlflow/pull/1208/files , which introduced a way to run the custom model on the host by the SQLFLow submitter Docker image, please take a look at it.

tonyyang-svail · 2019-11-19T23:34:21Z

Hi @Echo9573, thanks for submitting this issue.

Converting Pandas data frame to TensorFlow dataset can be done in one function, so the conversion should be straight forward. Also, I am wondering why it is hard to debug using the dataset. In both data frame and dataset mode, we can set a breakpoint in mymodel.call function to debug.
I really appreciate your consideration in integrating newly contributed models to SQLFlow. However, I am hesitated to add SQLFlow specific logic in this repo as the model definition (this repo) and runtime engine (SQLFlow, EDL, etc.) should be decoupled. In terms of testing the models, may I ask what kind of models are you trying to contribute?
1. If the models are like DNNClassifier which contains only two functions __init__ and call, I don't think we need to write a standalone train.py to test it. You can test it using tests/base.py as tests/test_dnnclassifier.py did.
2. If the models are like DeepEmbeddingClusterModel which wraps special training logic. Then we can work together to figure out a standard training API that both models reop and sqlflow repo should follow. We will materialize the standard to a base testing class like tests/base.py. If all model tests derived from that base class pass, SQLFlow should guarantee its train.py will also pass.
To sum up, models should be tested using tests/base.py. There is no need to write train.py to test models. SQLFlow's train.py is developed based on tests/base.py and SQLFLow should be responsible for integrating the models.
I totally agree that the sharing model only works if feature engineering is shared along with it. Is it possible to represent feature engineering in several SQL statements and share them along with the select ... to train some_model_from_model_zoo ... statement?

typhoonzero · 2019-11-22T07:30:02Z

@Echo9573 doing data pre-processing using SQL is currently under the design phase, please take a look at sql-machine-learning/elasticdl#1477 and we'll appreciate your comments and advises.

Yancey1989 assigned Yancey1989, llxxxll, weiguoz, typhoonzero, tonyyang-svail and wangkuiyi Nov 19, 2019

Echo9573 added DataScience Some issue about the application in data science DiDi The issue publisher is from DiDi labels Dec 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some problems in model development #28

Some problems in model development #28

Echo9573 commented Nov 19, 2019 •

edited by Yancey1989

Loading

Yancey1989 commented Nov 19, 2019

tonyyang-svail commented Nov 19, 2019 •

edited

Loading

typhoonzero commented Nov 22, 2019

Some problems in model development #28

Some problems in model development #28

Comments

Echo9573 commented Nov 19, 2019 • edited by Yancey1989 Loading

Yancey1989 commented Nov 19, 2019

tonyyang-svail commented Nov 19, 2019 • edited Loading

typhoonzero commented Nov 22, 2019

Echo9573 commented Nov 19, 2019 •

edited by Yancey1989

Loading

tonyyang-svail commented Nov 19, 2019 •

edited

Loading