[Question] How to get the pre-processed data used by `auto-sklearn` to train a model? #1700

rraadd88 · 2023-10-07T22:59:44Z

I would like to get the pre-processed data that was used to train a model.

How did this question come about?

The preprocessed data could be used to, for example, calculate its summary statistics and then compare with the un-transformed data or with the data preprocessed with different methods.

Would a small code snippet help?

This question is relevant to a standard application of AutoSklearnClassifier function based on the example given in the docs.
Here's a snippet anyways:

import sklearn.datasets
import sklearn.metrics
import autosklearn.classification

X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    X, y, random_state=1
)
automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=120,
    per_run_time_limit=30,
    tmp_folder="/tmp/autosklearn_classification_example_tmp",
)
automl.fit(X_train, y_train, dataset_name="breast_cancer")
## get configuration for a model/run
run_key = list(automl.automl_.runhistory_.data.keys())[0]
run_value = automl.automl_.runhistory_.data[run_key]
config=automl.automl_.runhistory_.ids_config[run_key.config_id]
print(config)

Configuration(values={
  'balancing:strategy': 'weighting',
  'classifier:__choice__': 'gradient_boosting',
  'classifier:gradient_boosting:early_stop': 'off',
  'classifier:gradient_boosting:l2_regularization': 0.5536468700597662,
  'classifier:gradient_boosting:learning_rate': 0.023910336277631047,
  'classifier:gradient_boosting:loss': 'auto',
  'classifier:gradient_boosting:max_bins': 255,
  'classifier:gradient_boosting:max_depth': 'None',
  'classifier:gradient_boosting:max_leaf_nodes': 12,
  'classifier:gradient_boosting:min_samples_leaf': 4,
  'classifier:gradient_boosting:scoring': 'loss',
  'classifier:gradient_boosting:tol': 1e-07,
  'data_preprocessor:__choice__': 'feature_type',
  'data_preprocessor:feature_type:numerical_transformer:imputation:strategy': 'most_frequent',
  'data_preprocessor:feature_type:numerical_transformer:rescaling:__choice__': 'standardize',
  'feature_preprocessor:__choice__': 'pca',
})

What have you already looked at?

I have already looked at

The Documentation, Examples and Issues, but I could not find any direct solution there.
I looked into the tmp_folders smac3-output and .auto-sklearn folders, but could not find any files containing preprocessed data or relevant information that I could use to get the preprocessed data.
I tried filtering the configuration and providing it to AutoSklearnPreprocessingAlgorithm function, for which I repeatedly got Not implemented errors.
I tried creating a custom sklearn Pipeline using the pre-processing functions from autosklearn e.g. rescaling from data_preprocessing module, but i found that this approach was not directly compatible with the configuration requirements of auto-sklearn.

Suggestion

A couple of functions could be implemented to (1) filter the configuration for a fitted model to keep only the keys related to the pre-processing steps, and then (2) run the corresponding steps to get the preprocessed data. For example, the code could look like this:

# Note: dummy code
import autosklearn.preprocessing

## filtered configuration
config_preprocessing=autosklearn.preprocessing.get_config_preprocessing(config)
print(config_preprocessing)
# Configuration(values={
#  'data_preprocessor:__choice__': 'feature_type',
#  'data_preprocessor:feature_type:numerical_transformer:imputation:strategy': 'most_frequent',
#  'data_preprocessor:feature_type:numerical_transformer:rescaling:__choice__': 'standardize',
#  'feature_preprocessor:__choice__': 'pca',
#})

## get the preprocessed data
X_preprocessed=autosklearn.preprocessing.fit_transform(
  X=X,
  configuration=config_preprocessing,
)

This is just a suggestion. If there is any other way of obtaining the pre-processed data, please let me know.

System Details (if relevant)

Version of auto-sklearn: 0.15.0
Running this on Linux.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] How to get the pre-processed data used by `auto-sklearn` to train a model? #1700

[Question] How to get the pre-processed data used by `auto-sklearn` to train a model? #1700

rraadd88 commented Oct 7, 2023

[Question] How to get the pre-processed data used by auto-sklearn to train a model? #1700

[Question] How to get the pre-processed data used by auto-sklearn to train a model? #1700

Comments

rraadd88 commented Oct 7, 2023

How did this question come about?

Would a small code snippet help?

What have you already looked at?

Suggestion

System Details (if relevant)

[Question] How to get the pre-processed data used by `auto-sklearn` to train a model? #1700

[Question] How to get the pre-processed data used by `auto-sklearn` to train a model? #1700