Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some problems in model development #28

Open
Echo9573 opened this issue Nov 19, 2019 · 3 comments
Open

Some problems in model development #28

Echo9573 opened this issue Nov 19, 2019 · 3 comments
Assignees
Labels
DataScience Some issue about the application in data science DiDi The issue publisher is from DiDi

Comments

@Echo9573
Copy link
Collaborator

Echo9573 commented Nov 19, 2019

Hi, I have some issues when I tried to develop SQLFlow models:

  • Analysts usually use Dataframe to manipulate data and use it as input to the Keras model. It is convenient to debug, but SQLFlow tf-codegen uses dataset, which requires additional learning costs.
  • It is troublesome to connect with SQLFlow. For models configured under SQLFlow models, if you want to debug locally, you need to implement a train.py yourself, including reading data, defining feature columns, etc.. However, train.py generated locally and train.py generated by SQLFlow do not always behave consistently.
  • Usually an analysis task includes feature engineering -> data preprocessing -> model training (prediction). At present, the model zoo only includes the last step. But actually, sharing model between operations is a chain that needs to share the entire data processing. I hope that SQLFlow will also have the ability to do custom data preprocessing and be included in the design of the model zoo.
@Yancey1989
Copy link
Collaborator

@Echo9573 for the second point, I have a PR about Couler function: https://github.com/sql-machine-learning/sqlflow/pull/1208/files , which introduced a way to run the custom model on the host by the SQLFLow submitter Docker image, please take a look at it.

@tonyyang-svail
Copy link
Contributor

tonyyang-svail commented Nov 19, 2019

Hi @Echo9573, thanks for submitting this issue.

  1. Converting Pandas data frame to TensorFlow dataset can be done in one function, so the conversion should be straight forward. Also, I am wondering why it is hard to debug using the dataset. In both data frame and dataset mode, we can set a breakpoint in mymodel.call function to debug.

  2. I really appreciate your consideration in integrating newly contributed models to SQLFlow. However, I am hesitated to add SQLFlow specific logic in this repo as the model definition (this repo) and runtime engine (SQLFlow, EDL, etc.) should be decoupled. In terms of testing the models, may I ask what kind of models are you trying to contribute?

    1. If the models are like DNNClassifier which contains only two functions __init__ and call, I don't think we need to write a standalone train.py to test it. You can test it using tests/base.py as tests/test_dnnclassifier.py did.
    2. If the models are like DeepEmbeddingClusterModel which wraps special training logic. Then we can work together to figure out a standard training API that both models reop and sqlflow repo should follow. We will materialize the standard to a base testing class like tests/base.py. If all model tests derived from that base class pass, SQLFlow should guarantee its train.py will also pass.

    To sum up, models should be tested using tests/base.py. There is no need to write train.py to test models. SQLFlow's train.py is developed based on tests/base.py and SQLFLow should be responsible for integrating the models.

  3. I totally agree that the sharing model only works if feature engineering is shared along with it. Is it possible to represent feature engineering in several SQL statements and share them along with the select ... to train some_model_from_model_zoo ... statement?

@typhoonzero
Copy link
Collaborator

@Echo9573 doing data pre-processing using SQL is currently under the design phase, please take a look at sql-machine-learning/elasticdl#1477 and we'll appreciate your comments and advises.

@Echo9573 Echo9573 added DataScience Some issue about the application in data science DiDi The issue publisher is from DiDi labels Dec 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DataScience Some issue about the application in data science DiDi The issue publisher is from DiDi
Projects
None yet
Development

No branches or pull requests

7 participants