Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] How to get the pre-processed data used by auto-sklearn to train a model? #1700

Open
rraadd88 opened this issue Oct 7, 2023 · 0 comments

Comments

@rraadd88
Copy link

rraadd88 commented Oct 7, 2023

I would like to get the pre-processed data that was used to train a model.

How did this question come about?

The preprocessed data could be used to, for example, calculate its summary statistics and then compare with the un-transformed data or with the data preprocessed with different methods.

Would a small code snippet help?

This question is relevant to a standard application of AutoSklearnClassifier function based on the example given in the docs.
Here's a snippet anyways:

import sklearn.datasets
import sklearn.metrics
import autosklearn.classification

X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    X, y, random_state=1
)
automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=120,
    per_run_time_limit=30,
    tmp_folder="/tmp/autosklearn_classification_example_tmp",
)
automl.fit(X_train, y_train, dataset_name="breast_cancer")
## get configuration for a model/run
run_key = list(automl.automl_.runhistory_.data.keys())[0]
run_value = automl.automl_.runhistory_.data[run_key]
config=automl.automl_.runhistory_.ids_config[run_key.config_id]
print(config)
Configuration(values={
  'balancing:strategy': 'weighting',
  'classifier:__choice__': 'gradient_boosting',
  'classifier:gradient_boosting:early_stop': 'off',
  'classifier:gradient_boosting:l2_regularization': 0.5536468700597662,
  'classifier:gradient_boosting:learning_rate': 0.023910336277631047,
  'classifier:gradient_boosting:loss': 'auto',
  'classifier:gradient_boosting:max_bins': 255,
  'classifier:gradient_boosting:max_depth': 'None',
  'classifier:gradient_boosting:max_leaf_nodes': 12,
  'classifier:gradient_boosting:min_samples_leaf': 4,
  'classifier:gradient_boosting:scoring': 'loss',
  'classifier:gradient_boosting:tol': 1e-07,
  'data_preprocessor:__choice__': 'feature_type',
  'data_preprocessor:feature_type:numerical_transformer:imputation:strategy': 'most_frequent',
  'data_preprocessor:feature_type:numerical_transformer:rescaling:__choice__': 'standardize',
  'feature_preprocessor:__choice__': 'pca',
})

What have you already looked at?

I have already looked at

  1. The Documentation, Examples and Issues, but I could not find any direct solution there.
  2. I looked into the tmp_folders smac3-output and .auto-sklearn folders, but could not find any files containing preprocessed data or relevant information that I could use to get the preprocessed data.
  3. I tried filtering the configuration and providing it to AutoSklearnPreprocessingAlgorithm function, for which I repeatedly got Not implemented errors.
  4. I tried creating a custom sklearn Pipeline using the pre-processing functions from autosklearn e.g. rescaling from data_preprocessing module, but i found that this approach was not directly compatible with the configuration requirements of auto-sklearn.

Suggestion

A couple of functions could be implemented to (1) filter the configuration for a fitted model to keep only the keys related to the pre-processing steps, and then (2) run the corresponding steps to get the preprocessed data. For example, the code could look like this:

# Note: dummy code
import autosklearn.preprocessing

## filtered configuration
config_preprocessing=autosklearn.preprocessing.get_config_preprocessing(config)
print(config_preprocessing)
# Configuration(values={
#  'data_preprocessor:__choice__': 'feature_type',
#  'data_preprocessor:feature_type:numerical_transformer:imputation:strategy': 'most_frequent',
#  'data_preprocessor:feature_type:numerical_transformer:rescaling:__choice__': 'standardize',
#  'feature_preprocessor:__choice__': 'pca',
#})

## get the preprocessed data
X_preprocessed=autosklearn.preprocessing.fit_transform(
  X=X,
  configuration=config_preprocessing,
)

This is just a suggestion. If there is any other way of obtaining the pre-processed data, please let me know.

System Details (if relevant)

  • Version of auto-sklearn: 0.15.0
  • Running this on Linux.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant