Your name: Daniel Kim
Your email: [email protected]
Your company/organization: Twitter
Project name: XGBoost Evaluator Component
Add support for evaluating XGBoost model by extending the standard component Evaluator. Add an example pipeline that trains, evaluates and pushes an XGBoost model to CAIP.
Component + Example
This project can be used whenever customers wish to evaluate XGBoost models within a TFX pipeline, in order to obtain the various benefits and functionalities that TFX supports.
To make the Evaluator works with XGBoost models, we can customize the Evaluator by providing a Python module with:
custom_eval_shared_model()
to load model artifacts that are not standard TF modelscustom_extractors()
to inject a custom prediction extractor. Similar to (tfma.extractors.PredictExtractor)[https://www.tensorflow.org/tfx/model_analysis/api_docs/python/tfma/extractors/PredictExtractor], this extractor uses Beam PTransform to load and extract predictions.
Option 1 (chosen): working with XGBoost library directly
The XGBoost library provides a few different ways to save a model (an xgb.Booster
or xgb.sklearn.XGBModel
object). Backward compatibility is guaranteed in most cases. Currently, the 2 main supported formats are:
- XGBoost internal binary format. Note that Auxiliary attributes of the Python Booster object (such as feature_names) will not be loaded when using binary format.
- JSON: newer format aiming to replace the binary format
For maximum compatibility, we want the Trainer component to output both formats at the expected output directory, and can provide a helper function, which takes in a Booster object then writes model.bin
and model.json
to the expected directory.
The Evaluator uses the latest version of the xgboost library to read model.json
- this will be implemented in UDF custom_eval_shared_model()
. This way, we can expect the loaded model object to have most necessary information retained.
Option 2: using sklearn Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline as SkPipeline
classifier = xgb.XGBClassifier(**params)
model = SkPipeline([
('scaler', StandardScaler()),
('classifier', classifier),
])
model.fit(x_train, y_train)
# you can choose to save just the XGBClassifier:
model.steps[1][1].save_model(...)
However, it’s more likely that you’ll need the whole sklearn Pipeline in downstream evaluation. There are 2 methods to save and load sklearn Pipeline object:
Using joblib:
import joblib
joblib.dump(pipeline, 'model.joblib')
Note that CAIP asks users to use sklearn.externals.joblib
rather than the bare joblib
, but newer versions of sklearn have deprecated skearn.externals
.
Using pickle:
import pickle
with open('model.pkl', 'wb') as model_file:
pickle.dump(pipeline, model_file)
The main downside of working with sklearn Pipeline is potentially losing portability, this will be discussed further in the next section.
The Evaluator component will also utilize a custom prediction extractor, which would load and run our EvalSharedModel(s) on given examples. Xgboost models cannot accept tf.Example
s as input, so they will have to be converted within the function.
Our custom prediction extractor essentially governs conversion of data to formats that xgboost can accept, extraction of the necessary features of the data, the actual prediction, and framework code supporting all of these operations. It will be passed into (this)[https://tensorflow.google.cn/tfx/model_analysis/api_docs/python/tfma/default_extractors] tfma.default_extractors
function for use in the Evaluator.
Currently, we plan to support running the Evaluator with Apache Beam through the use of a customized prediction DoFn to load, process, and run predictions on models, and a simple pipeline wrapper that calls this function on extracts.
The actual implementation of the custom prediction extractor depends on whether it receives a native XGBoost serialized model (option 1 from above), or a pickled sklearn Pipeline (option 2 from above). Here are pros and cons of each option.
Option 1 (chosen):
- Pros:
- Universal among the various XGBoost interfaces (Python, JVM, C++, etc.)
- Some level of backward compatibility is guaranteed
- Still retain attributes such as feature_names, feature_types, etc. (in newer xgboost versions)
- Cons:
- Lack ability to combine with or substitute in other sklearn models
Option 2:
-
Pros:
- Another wrapping layer means more flexibility, you can add some pre-processing and post-processing to the sklearn Pipeline, try out other types of models, etc.
- Most of the code needed for sklearn-compatible Trainer and Evaluator in the penguin sklearn pipeline can be reused
-
Cons:
- Extra dependency on sklearn
- Using Python pickle standard library or joblib, which is specific to Python
- Lack of guarantee for backward compatibility
Summary: we will go with option 1 for simplicity and broader compatibility across different XGBoost interfaces.
Open questions: From training performance view point, is there a difference between using native xgboost vs using sklearn Pipeline?
In the same spirit as https://github.com/tensorflow/tfx-addons/blob/main/proposals/20210404-sklearn_example.md, we will add an example pipeline that run locally, and another version that runs on GCP using Vertex AI Pipelines.
Example pipeline will have its end to end local unit test.
The model can be pushed to CAIP. The current CAIP runtime version 2.5 runs XGBoost 1.4.0. Training, serialization, and deserialization XGBoost models using different versions of the library is allowed. In other words, the XGBoost library guarantees some level of backward compatibility.
This example pipeline will not be packaged, instead, users just need to clone the source code to run the example.
tfx_addons/xgboost_evaluator
(evaluator code and tests)examples/xgboost_penguins
(example pipelines and tests)
xgboost>=1.4.0
Daniel Kim, kindalime, [email protected]
Vincent Nguyen, cent5, [email protected]