diff --git a/AUTHORS.rst b/AUTHORS.rst index e0000ac..703e073 100644 --- a/AUTHORS.rst +++ b/AUTHORS.rst @@ -16,6 +16,10 @@ Authors * Sayan Patra * Yi Su * Rachit Arora +* Brian Vegetabile +* Qiang Fei +* Phil Gaudreau +* Yi-Wei Liu Other Contributors ------------------ diff --git a/HISTORY.rst b/HISTORY.rst index 202287f..84d0b79 100644 --- a/HISTORY.rst +++ b/HISTORY.rst @@ -2,6 +2,31 @@ History ======= +0.5.0 (2023-04-03) +------------------ + +Python 3.10 support. + +* New features and methods + * Improvements on modeling holidays + * @Yi Su: Added ``HolidayGrouper``. Holidays with a similar effect are grouped together to have fewer and more robust coefficients. With this feature, holidays are better modeled with improved forecast accuracy. + * @Kaixu Yang: Added support for holiday neighboring impact for data with frequency longer than daily through the ``daily_event_neighbor_impact`` parameter (e.g. this enables modeling holidays on weekly data where the event dates may not fall on the exact timestamps); added holiday neighboring events (i.e. the lags of an actual holiday can be specified in the model) through the ``daily_event_shifted_effect`` parameter. + * @Yi Su: Added holiday indicators. Now users can specify "is_event_exact", "is_event_adjacent", "is_event" (a union of both) as ``extra_pred_cols`` in the model. + * @Reza Hosseini: Added DST indicators. Now users can specify "us_dst" or "eu_dst" in ``extra_pred_cols``. You may also use ``get_us_dst_start/end``, ``get_eu_dst_start/end`` functions to get the dates. + * @Yi Su: Theoretical improvements for the volatility model in linear and ridge algorithm for more accurate variance estimate and prediction intervals. + * @Phil Gaudreau: Added new evaluation metric: ``mean_interval_score``. + * @Brian Vegetabile: Enhanced components plot that consolidates previous forecast breakdown functionality. The redesign provides a cleaner visual and allows for flexible breakdowns via regular expressions. + +* Library enhancements + * @Kaixu Yang: Python 3.10 support. Deprecated support for lower Python versions. + * @Sayan Patra: New utility function: ``get_exploratory_plots`` to easily generate exploratory data analysis (EDA) plots in HTML. + * @Kaixu Yang: Added ``optimize_mape`` option to quantile regression. It uses 1 over y as weights in the loss function. + +* Bug fixes + * @Qiang Fei: In case of simulation, now ``min_adimissible_value`` and ``max_adimissible_value`` will correctly cap the simulated values. Additionally, errors are propagated through simulation steps to make the intervals more accurate. + * @Yi Su, @Sayan Patra: Now ``train_end_date`` is always respected if specified by the user. Previously it got ignored if there are trailing NA’s in training data or ``anomaly_df`` imputes the anomalous points to NA. Also, now ``train_end_date`` accepts a string value. + * @Yi Su: The seasonality order now takes `None` without raising an error. It will be treated the same as `False` or zero. + 0.4.0 (2022-07-15) ------------------ diff --git a/README.rst b/README.rst index cd11b51..d7699ad 100644 --- a/README.rst +++ b/README.rst @@ -1,167 +1,167 @@ -Greykite: A flexible, intuitive and fast forecasting library - -.. raw:: html - -

- -

- -Why Greykite? -------------- - -The Greykite library provides flexible, intuitive and fast forecasts through its flagship algorithm, Silverkite. - -Silverkite algorithm works well on most time series, and is especially adept for those with changepoints in trend or seasonality, -event/holiday effects, and temporal dependencies. -Its forecasts are interpretable and therefore useful for trusted decision-making and insights. - -The Greykite library provides a framework that makes it easy to develop a good forecast model, -with exploratory data analysis, outlier/anomaly preprocessing, feature extraction and engineering, grid search, -evaluation, benchmarking, and plotting. -Other open source algorithms can be supported through Greykite’s interface to take advantage of this framework, -as listed below. - -For a demo, please see our `quickstart `_. - -Distinguishing Features ------------------------ - -* Flexible design - * Provides time series regressors to capture trend, seasonality, holidays, - changepoints, and autoregression, and lets you add your own. - * Fits the forecast using a machine learning model of your choice. -* Intuitive interface - * Provides powerful plotting tools to explore seasonality, interactions, changepoints, etc. - * Provides model templates (default parameters) that work well based on - data characteristics and forecast requirements (e.g. daily long-term forecast). - * Produces interpretable output, with model summary to examine individual regressors, - and component plots to visually inspect the combined effect of related regressors. -* Fast training and scoring - * Facilitates interactive prototyping, grid search, and benchmarking. - Grid search is useful for model selection and semi-automatic forecasting of multiple metrics. -* Extensible framework - * Exposes multiple forecast algorithms in the same interface, - making it easy to try algorithms from different libraries and compare results. - * The same pipeline provides preprocessing, cross-validation, - backtest, forecast, and evaluation with any algorithm. - -Algorithms currently supported within Greykite’s modeling framework: - -* Silverkite (Greykite’s flagship algorithm) -* `Facebook Prophet `_ -* `Auto Arima `_ - -Notable Components ------------------- - -Greykite offers components that could be used within other forecasting -libraries or even outside the forecasting context. - -* ModelSummary() - R-like summaries of `scikit-learn` and `statsmodels` regression models. -* ChangepointDetector() - changepoint detection based on adaptive lasso, with visualization. -* SimpleSilverkiteForecast() - Silverkite algorithm with `forecast_simple` and `predict` methods. -* SilverkiteForecast() - low-level interface to Silverkite algorithm with `forecast` and `predict` methods. -* ReconcileAdditiveForecasts() - adjust a set of forecasts to satisfy inter-forecast additivity constraints. - -Usage Examples --------------- - -You can obtain forecasts with only a few lines of code: - -.. code-block:: python - - from greykite.common.data_loader import DataLoader - from greykite.framework.templates.autogen.forecast_config import ForecastConfig - from greykite.framework.templates.autogen.forecast_config import MetadataParam - from greykite.framework.templates.forecaster import Forecaster - from greykite.framework.templates.model_templates import ModelTemplateEnum - - # Defines inputs - df = DataLoader().load_bikesharing().tail(24*90) # Input time series (pandas.DataFrame) - config = ForecastConfig( - metadata_param=MetadataParam(time_col="ts", value_col="count"), # Column names in `df` - model_template=ModelTemplateEnum.AUTO.name, # AUTO model configuration - forecast_horizon=24, # Forecasts 24 steps ahead - coverage=0.95, # 95% prediction intervals - ) - - # Creates forecasts - forecaster = Forecaster() - result = forecaster.run_forecast_config(df=df, config=config) - - # Accesses results - result.forecast # Forecast with metrics, diagnostics - result.backtest # Backtest with metrics, diagnostics - result.grid_search # Time series CV result - result.model # Trained model - result.timeseries # Processed time series with plotting functions - -For a demo, please see our `quickstart `_. - -Setup and Installation ----------------------- - -Greykite is available on Pypi and can be installed with pip: - -.. code-block:: - - pip install greykite - -For more installation tips, see `installation `_. - -Documentation -------------- - -Please find our full documentation `here `_. - -Learn More ----------- - -* `Website `_ -* `Paper `_ (KDD '22 Best Paper Runner-up, Applied Data Science Track) -* `Blog post `_ - -Citation --------- - -Please cite Greykite in your publications if it helps your research: - -.. code-block:: - - @misc{reza2021greykite-github, - author = {Reza Hosseini and - Albert Chen and - Kaixu Yang and - Sayan Patra and - Yi Su and - Rachit Arora}, - title = {Greykite: a flexible, intuitive and fast forecasting library}, - url = {https://github.com/linkedin/greykite}, - year = {2021} - } - -.. code-block:: - - @inproceedings{reza2022greykite-kdd, - author = {Hosseini, Reza and Chen, Albert and Yang, Kaixu and Patra, Sayan and Su, Yi and Al Orjany, Saad Eddin and Tang, Sishi and Ahammad, Parvez}, - title = {Greykite: Deploying Flexible Forecasting at Scale at LinkedIn}, - year = {2022}, - isbn = {9781450393850}, - publisher = {Association for Computing Machinery}, - address = {New York, NY, USA}, - url = {https://doi.org/10.1145/3534678.3539165}, - doi = {10.1145/3534678.3539165}, - booktitle = {Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining}, - pages = {3007–3017}, - numpages = {11}, - keywords = {forecasting, scalability, interpretable machine learning, time series}, - location = {Washington DC, USA}, - series = {KDD '22} - } - - -License -------- - -Copyright (c) LinkedIn Corporation. All rights reserved. Licensed under the +Greykite: A flexible, intuitive and fast forecasting library + +.. raw:: html + +

+ +

+ +Why Greykite? +------------- + +The Greykite library provides flexible, intuitive and fast forecasts through its flagship algorithm, Silverkite. + +Silverkite algorithm works well on most time series, and is especially adept for those with changepoints in trend or seasonality, +event/holiday effects, and temporal dependencies. +Its forecasts are interpretable and therefore useful for trusted decision-making and insights. + +The Greykite library provides a framework that makes it easy to develop a good forecast model, +with exploratory data analysis, outlier/anomaly preprocessing, feature extraction and engineering, grid search, +evaluation, benchmarking, and plotting. +Other open source algorithms can be supported through Greykite’s interface to take advantage of this framework, +as listed below. + +For a demo, please see our `quickstart `_. + +Distinguishing Features +----------------------- + +* Flexible design + * Provides time series regressors to capture trend, seasonality, holidays, + changepoints, and autoregression, and lets you add your own. + * Fits the forecast using a machine learning model of your choice. +* Intuitive interface + * Provides powerful plotting tools to explore seasonality, interactions, changepoints, etc. + * Provides model templates (default parameters) that work well based on + data characteristics and forecast requirements (e.g. daily long-term forecast). + * Produces interpretable output, with model summary to examine individual regressors, + and component plots to visually inspect the combined effect of related regressors. +* Fast training and scoring + * Facilitates interactive prototyping, grid search, and benchmarking. + Grid search is useful for model selection and semi-automatic forecasting of multiple metrics. +* Extensible framework + * Exposes multiple forecast algorithms in the same interface, + making it easy to try algorithms from different libraries and compare results. + * The same pipeline provides preprocessing, cross-validation, + backtest, forecast, and evaluation with any algorithm. + +Algorithms currently supported within Greykite’s modeling framework: + +* Silverkite (Greykite’s flagship algorithm) +* `Facebook Prophet `_ +* `Auto Arima `_ + +Notable Components +------------------ + +Greykite offers components that could be used within other forecasting +libraries or even outside the forecasting context. + +* ModelSummary() - R-like summaries of `scikit-learn` and `statsmodels` regression models. +* ChangepointDetector() - changepoint detection based on adaptive lasso, with visualization. +* SimpleSilverkiteForecast() - Silverkite algorithm with `forecast_simple` and `predict` methods. +* SilverkiteForecast() - low-level interface to Silverkite algorithm with `forecast` and `predict` methods. +* ReconcileAdditiveForecasts() - adjust a set of forecasts to satisfy inter-forecast additivity constraints. + +Usage Examples +-------------- + +You can obtain forecasts with only a few lines of code: + +.. code-block:: python + + from greykite.common.data_loader import DataLoader + from greykite.framework.templates.autogen.forecast_config import ForecastConfig + from greykite.framework.templates.autogen.forecast_config import MetadataParam + from greykite.framework.templates.forecaster import Forecaster + from greykite.framework.templates.model_templates import ModelTemplateEnum + + # Defines inputs + df = DataLoader().load_bikesharing().tail(24*90) # Input time series (pandas.DataFrame) + config = ForecastConfig( + metadata_param=MetadataParam(time_col="ts", value_col="count"), # Column names in `df` + model_template=ModelTemplateEnum.AUTO.name, # AUTO model configuration + forecast_horizon=24, # Forecasts 24 steps ahead + coverage=0.95, # 95% prediction intervals + ) + + # Creates forecasts + forecaster = Forecaster() + result = forecaster.run_forecast_config(df=df, config=config) + + # Accesses results + result.forecast # Forecast with metrics, diagnostics + result.backtest # Backtest with metrics, diagnostics + result.grid_search # Time series CV result + result.model # Trained model + result.timeseries # Processed time series with plotting functions + +For a demo, please see our `quickstart `_. + +Setup and Installation +---------------------- + +Greykite is available on Pypi and can be installed with pip: + +.. code-block:: + + pip install greykite + +For more installation tips, see `installation `_. + +Documentation +------------- + +Please find our full documentation `here `_. + +Learn More +---------- + +* `Website `_ +* `Paper `_ (KDD '22 Best Paper Runner-up, Applied Data Science Track) +* `Blog post `_ + +Citation +-------- + +Please cite Greykite in your publications if it helps your research: + +.. code-block:: + + @misc{reza2021greykite-github, + author = {Reza Hosseini and + Albert Chen and + Kaixu Yang and + Sayan Patra and + Yi Su and + Rachit Arora}, + title = {Greykite: a flexible, intuitive and fast forecasting library}, + url = {https://github.com/linkedin/greykite}, + year = {2021} + } + +.. code-block:: + + @inproceedings{reza2022greykite-kdd, + author = {Hosseini, Reza and Chen, Albert and Yang, Kaixu and Patra, Sayan and Su, Yi and Al Orjany, Saad Eddin and Tang, Sishi and Ahammad, Parvez}, + title = {Greykite: Deploying Flexible Forecasting at Scale at LinkedIn}, + year = {2022}, + isbn = {9781450393850}, + publisher = {Association for Computing Machinery}, + address = {New York, NY, USA}, + url = {https://doi.org/10.1145/3534678.3539165}, + doi = {10.1145/3534678.3539165}, + booktitle = {Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining}, + pages = {3007–3017}, + numpages = {11}, + keywords = {forecasting, scalability, interpretable machine learning, time series}, + location = {Washington DC, USA}, + series = {KDD '22} + } + + +License +------- + +Copyright (c) LinkedIn Corporation. All rights reserved. Licensed under the `BSD 2-Clause `_ License. \ No newline at end of file diff --git a/README_PYPI.rst b/README_PYPI.rst index 11d0b0c..78cb5c0 100644 --- a/README_PYPI.rst +++ b/README_PYPI.rst @@ -1,168 +1,168 @@ -Greykite: A flexible, intuitive and fast forecasting library - -.. image:: https://raw.githubusercontent.com/linkedin/greykite/master/LOGO-C8.png - :height: 300px - :width: 450px - :scale: 80% - :alt: Greykite - :align: center - -Why Greykite? -------------- - -The Greykite library provides flexible, intuitive and fast forecasts through its flagship algorithm, Silverkite. - -Silverkite algorithm works well on most time series, and is especially adept for those with changepoints in trend or seasonality, -event/holiday effects, and temporal dependencies. -Its forecasts are interpretable and therefore useful for trusted decision-making and insights. - -The Greykite library provides a framework that makes it easy to develop a good forecast model, -with exploratory data analysis, outlier/anomaly preprocessing, feature extraction and engineering, grid search, -evaluation, benchmarking, and plotting. -Other open source algorithms can be supported through Greykite’s interface to take advantage of this framework, -as listed below. - -For a demo, please see our `quickstart `_. - -Distinguishing Features ------------------------ - -* Flexible design - * Provides time series regressors to capture trend, seasonality, holidays, - changepoints, and autoregression, and lets you add your own. - * Fits the forecast using a machine learning model of your choice. -* Intuitive interface - * Provides powerful plotting tools to explore seasonality, interactions, changepoints, etc. - * Provides model templates (default parameters) that work well based on - data characteristics and forecast requirements (e.g. daily long-term forecast). - * Produces interpretable output, with model summary to examine individual regressors, - and component plots to visually inspect the combined effect of related regressors. -* Fast training and scoring - * Facilitates interactive prototyping, grid search, and benchmarking. - Grid search is useful for model selection and semi-automatic forecasting of multiple metrics. -* Extensible framework - * Exposes multiple forecast algorithms in the same interface, - making it easy to try algorithms from different libraries and compare results. - * The same pipeline provides preprocessing, cross-validation, - backtest, forecast, and evaluation with any algorithm. - -Algorithms currently supported within Greykite’s modeling framework: - -* Silverkite (Greykite’s flagship algorithm) -* `Facebook Prophet `_ -* `Auto Arima `_ - -Notable Components ------------------- - -Greykite offers components that could be used within other forecasting -libraries or even outside the forecasting context. - -* ModelSummary() - R-like summaries of `scikit-learn` and `statsmodels` regression models. -* ChangepointDetector() - changepoint detection based on adaptive lasso, with visualization. -* SimpleSilverkiteForecast() - Silverkite algorithm with `forecast_simple` and `predict` methods. -* SilverkiteForecast() - low-level interface to Silverkite algorithm with `forecast` and `predict` methods. -* ReconcileAdditiveForecasts() - adjust a set of forecasts to satisfy inter-forecast additivity constraints. - -Usage Examples --------------- - -You can obtain forecasts with only a few lines of code: - -.. code-block:: python - - from greykite.common.data_loader import DataLoader - from greykite.framework.templates.autogen.forecast_config import ForecastConfig - from greykite.framework.templates.autogen.forecast_config import MetadataParam - from greykite.framework.templates.forecaster import Forecaster - from greykite.framework.templates.model_templates import ModelTemplateEnum - - # Defines inputs - df = DataLoader().load_bikesharing().tail(24*90) # Input time series (pandas.DataFrame) - config = ForecastConfig( - metadata_param=MetadataParam(time_col="ts", value_col="count"), # Column names in `df` - model_template=ModelTemplateEnum.AUTO.name, # AUTO model configuration - forecast_horizon=24, # Forecasts 24 steps ahead - coverage=0.95, # 95% prediction intervals - ) - - # Creates forecasts - forecaster = Forecaster() - result = forecaster.run_forecast_config(df=df, config=config) - - # Accesses results - result.forecast # Forecast with metrics, diagnostics - result.backtest # Backtest with metrics, diagnostics - result.grid_search # Time series CV result - result.model # Trained model - result.timeseries # Processed time series with plotting functions - -For a demo, please see our `quickstart `_. - -Setup and Installation ----------------------- - -Greykite is available on Pypi and can be installed with pip: - -.. code-block:: - - pip install greykite - -For more installation tips, see `installation `_. - -Documentation -------------- - -Please find our full documentation `here `_. - -Learn More ----------- - -* `Website `_ -* `Paper `_ (KDD '22 Best Paper Runner-up, Applied Data Science Track) -* `Blog post `_ - -Citation --------- - -Please cite Greykite in your publications if it helps your research: - -.. code-block:: - - @misc{reza2021greykite-github, - author = {Reza Hosseini and - Albert Chen and - Kaixu Yang and - Sayan Patra and - Yi Su and - Rachit Arora}, - title = {Greykite: a flexible, intuitive and fast forecasting library}, - url = {https://github.com/linkedin/greykite}, - year = {2021} - } - -.. code-block:: - - @inproceedings{reza2022greykite-kdd, - author = {Hosseini, Reza and Chen, Albert and Yang, Kaixu and Patra, Sayan and Su, Yi and Al Orjany, Saad Eddin and Tang, Sishi and Ahammad, Parvez}, - title = {Greykite: Deploying Flexible Forecasting at Scale at LinkedIn}, - year = {2022}, - isbn = {9781450393850}, - publisher = {Association for Computing Machinery}, - address = {New York, NY, USA}, - url = {https://doi.org/10.1145/3534678.3539165}, - doi = {10.1145/3534678.3539165}, - booktitle = {Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining}, - pages = {3007–3017}, - numpages = {11}, - keywords = {forecasting, scalability, interpretable machine learning, time series}, - location = {Washington DC, USA}, - series = {KDD '22} - } - - -License -------- - -Copyright (c) LinkedIn Corporation. All rights reserved. Licensed under the +Greykite: A flexible, intuitive and fast forecasting library + +.. image:: https://raw.githubusercontent.com/linkedin/greykite/master/LOGO-C8.png + :height: 300px + :width: 450px + :scale: 80% + :alt: Greykite + :align: center + +Why Greykite? +------------- + +The Greykite library provides flexible, intuitive and fast forecasts through its flagship algorithm, Silverkite. + +Silverkite algorithm works well on most time series, and is especially adept for those with changepoints in trend or seasonality, +event/holiday effects, and temporal dependencies. +Its forecasts are interpretable and therefore useful for trusted decision-making and insights. + +The Greykite library provides a framework that makes it easy to develop a good forecast model, +with exploratory data analysis, outlier/anomaly preprocessing, feature extraction and engineering, grid search, +evaluation, benchmarking, and plotting. +Other open source algorithms can be supported through Greykite’s interface to take advantage of this framework, +as listed below. + +For a demo, please see our `quickstart `_. + +Distinguishing Features +----------------------- + +* Flexible design + * Provides time series regressors to capture trend, seasonality, holidays, + changepoints, and autoregression, and lets you add your own. + * Fits the forecast using a machine learning model of your choice. +* Intuitive interface + * Provides powerful plotting tools to explore seasonality, interactions, changepoints, etc. + * Provides model templates (default parameters) that work well based on + data characteristics and forecast requirements (e.g. daily long-term forecast). + * Produces interpretable output, with model summary to examine individual regressors, + and component plots to visually inspect the combined effect of related regressors. +* Fast training and scoring + * Facilitates interactive prototyping, grid search, and benchmarking. + Grid search is useful for model selection and semi-automatic forecasting of multiple metrics. +* Extensible framework + * Exposes multiple forecast algorithms in the same interface, + making it easy to try algorithms from different libraries and compare results. + * The same pipeline provides preprocessing, cross-validation, + backtest, forecast, and evaluation with any algorithm. + +Algorithms currently supported within Greykite’s modeling framework: + +* Silverkite (Greykite’s flagship algorithm) +* `Facebook Prophet `_ +* `Auto Arima `_ + +Notable Components +------------------ + +Greykite offers components that could be used within other forecasting +libraries or even outside the forecasting context. + +* ModelSummary() - R-like summaries of `scikit-learn` and `statsmodels` regression models. +* ChangepointDetector() - changepoint detection based on adaptive lasso, with visualization. +* SimpleSilverkiteForecast() - Silverkite algorithm with `forecast_simple` and `predict` methods. +* SilverkiteForecast() - low-level interface to Silverkite algorithm with `forecast` and `predict` methods. +* ReconcileAdditiveForecasts() - adjust a set of forecasts to satisfy inter-forecast additivity constraints. + +Usage Examples +-------------- + +You can obtain forecasts with only a few lines of code: + +.. code-block:: python + + from greykite.common.data_loader import DataLoader + from greykite.framework.templates.autogen.forecast_config import ForecastConfig + from greykite.framework.templates.autogen.forecast_config import MetadataParam + from greykite.framework.templates.forecaster import Forecaster + from greykite.framework.templates.model_templates import ModelTemplateEnum + + # Defines inputs + df = DataLoader().load_bikesharing().tail(24*90) # Input time series (pandas.DataFrame) + config = ForecastConfig( + metadata_param=MetadataParam(time_col="ts", value_col="count"), # Column names in `df` + model_template=ModelTemplateEnum.AUTO.name, # AUTO model configuration + forecast_horizon=24, # Forecasts 24 steps ahead + coverage=0.95, # 95% prediction intervals + ) + + # Creates forecasts + forecaster = Forecaster() + result = forecaster.run_forecast_config(df=df, config=config) + + # Accesses results + result.forecast # Forecast with metrics, diagnostics + result.backtest # Backtest with metrics, diagnostics + result.grid_search # Time series CV result + result.model # Trained model + result.timeseries # Processed time series with plotting functions + +For a demo, please see our `quickstart `_. + +Setup and Installation +---------------------- + +Greykite is available on Pypi and can be installed with pip: + +.. code-block:: + + pip install greykite + +For more installation tips, see `installation `_. + +Documentation +------------- + +Please find our full documentation `here `_. + +Learn More +---------- + +* `Website `_ +* `Paper `_ (KDD '22 Best Paper Runner-up, Applied Data Science Track) +* `Blog post `_ + +Citation +-------- + +Please cite Greykite in your publications if it helps your research: + +.. code-block:: + + @misc{reza2021greykite-github, + author = {Reza Hosseini and + Albert Chen and + Kaixu Yang and + Sayan Patra and + Yi Su and + Rachit Arora}, + title = {Greykite: a flexible, intuitive and fast forecasting library}, + url = {https://github.com/linkedin/greykite}, + year = {2021} + } + +.. code-block:: + + @inproceedings{reza2022greykite-kdd, + author = {Hosseini, Reza and Chen, Albert and Yang, Kaixu and Patra, Sayan and Su, Yi and Al Orjany, Saad Eddin and Tang, Sishi and Ahammad, Parvez}, + title = {Greykite: Deploying Flexible Forecasting at Scale at LinkedIn}, + year = {2022}, + isbn = {9781450393850}, + publisher = {Association for Computing Machinery}, + address = {New York, NY, USA}, + url = {https://doi.org/10.1145/3534678.3539165}, + doi = {10.1145/3534678.3539165}, + booktitle = {Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining}, + pages = {3007–3017}, + numpages = {11}, + keywords = {forecasting, scalability, interpretable machine learning, time series}, + location = {Washington DC, USA}, + series = {KDD '22} + } + + +License +------- + +Copyright (c) LinkedIn Corporation. All rights reserved. Licensed under the `BSD 2-Clause `_ License. \ No newline at end of file diff --git a/docs/conf.py b/docs/conf.py index 056d9e0..d0d5917 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -266,10 +266,10 @@ intersphinx_mapping = { 'numpy': ('https://numpy.org/devdocs/', None), - 'pandas': ('https://pandas.pydata.org/pandas-docs/version/0.24.2/', None), + 'pandas': ('https://pandas.pydata.org/pandas-docs/version/1.5.0/', None), 'plotly': ('https://plotly.com/python-api-reference/', None), - 'python': ('http://docs.python.org/3.7/', None), + 'python': ('http://docs.python.org/3.10/', None), 'scipy': ('https://docs.scipy.org/doc/scipy/reference/', None), - 'sklearn': ('https://scikit-learn.org/0.24/', None), - 'stasmodels': ('https://www.statsmodels.org/v0.10.1/', None) + 'sklearn': ('https://scikit-learn.org/1.1/', None), + 'stasmodels': ('https://www.statsmodels.org/v0.13.5/', None) } diff --git a/docs/nbpages/quickstart/0100_simple_forecast.py b/docs/nbpages/quickstart/0100_simple_forecast.py index 1cfd29c..8d465e5 100644 --- a/docs/nbpages/quickstart/0100_simple_forecast.py +++ b/docs/nbpages/quickstart/0100_simple_forecast.py @@ -156,7 +156,11 @@ # Model Diagnostics # ^^^^^^^^^^^^^^^^^ # The component plot shows how your dataset's trend, -# seasonality, and event / holiday patterns are handled in the model: +# seasonality, event / holiday and other patterns are handled in the model. +# When called, with defaults, function displays three plots: 1) components of the model, +# 2) linear trend and changepoints, and 3) the residuals of the model and +# smoothed estimates of the residuals. By clicking different legend entries, the visibility of +# lines in each plot can be toggled on or off. fig = forecast.plot_components() plotly.io.show(fig) # fig.show() if you are using "PROPHET" template diff --git a/docs/nbpages/quickstart/01_exploration/0200_auto_configuration_tools.py b/docs/nbpages/quickstart/01_exploration/0200_auto_configuration_tools.py index 2bda022..a9e534e 100644 --- a/docs/nbpages/quickstart/01_exploration/0200_auto_configuration_tools.py +++ b/docs/nbpages/quickstart/01_exploration/0200_auto_configuration_tools.py @@ -9,6 +9,7 @@ * seasonality inferrer * holiday inferrer + * holiday grouper .. note:: If you use the model templates, you can specify the "auto" option for certain model components @@ -283,3 +284,154 @@ # based on the inferred results, that is consumable directly by the Silverkite model. hi.generate_daily_event_dict() + +# %% +# Holiday Grouper +# --------------- +# +# One step further, `~greykite.algo.common.holiday_grouper.HolidayGrouper` +# is a convenient tool that automatically groups similar holidays and their neighboring days +# together based on their estimated impact and clustering algorithms. +# This helps to (1) reduce the number of parameters to be estimated +# and have each group have sufficient data points to be reliably estimated; +# (2) make sure different holidays can be separately modeled to avoid confounding effects. +# +# Also, we provide flexible diagnostics to help users choose the number of groups, as well as +# utility functions to spot check which group a holiday belongs to and what are the similar +# holidays within the same group. +# +# How it works +# ~~~~~~~~~~~~ +# +# First, we need to supply the algorithm a list of holidays and dates, as well as a time series of interest. +# In addition, we specify a dictionary of neighboring days that a holiday may have effect on. +# For example, for Thanksgiving that always falls on Thursday, we may expect a holiday effect +# that starts the day before and lasts till the coming Monday, then we can specify +# ``"Thanksgiving": (1, 4)`` as an item in the dictionary. +# All the neighboring days specified as such will be added to the events pool. +# Note that each neighboring day is also treated as a single event, and may not end up with the same group +# as its original holiday date. +# That is, ``"Thanksgiving_plus_4"`` (Monday) may have a very different impact than +# ``"Thanksgiving`` (Thursday) and they may not end up with being in the same group. +# +# Second, we also note that holidays falling on weekdays may have a different impact than those on weekends. +# For example, ``"Christmas Day_WE"`` may have a different effect than ``"Christmas Day_WD"``. +# We included two built-in options ("wd_we": weekday vs weekend, "dow_grouped": weekday, Sat, Sun), but one +# can custom their own grouping via ``get_suffix_func`` parameter. +# +# Next, each single event gets a score, the estimated (relative) impact that uses the same methodology +# as in the Holiday Inferrer (e.g. -0.1 means 10% lower than the baseline). +# For example, you can use ``baseline_offsets=[-7, 7]``. +# The score will then be used for the clustering algorithm. Therefore, if an event only shows up once +# in the input time series, the estimated impact may not be accurate. +# One can set the minimal number of occurrences of an event by parameter ``min_n_days`` (set it to 1 if +# you are okay with including all events that appear only once on a single day in the input data). +# Also, you can specify the minimal average score of an event to be kept in consideration by ``min_abs_avg_score``. +# If an event has an average score of -1% (across all its occurrences), it may not be worth including in the model. +# Absolute effects lower than ``min_abs_avg_score`` will be excluded before clustering. +# Also, if an event have inconsistent scores (e.g. two occurrences have -8%, +5% respectively), then this could be +# noise rather than signal. These events are excluded as well. +# This is handled automatically and user does not need to worry about it. +# +# The last step of the grouper is to group events that have similar effects and generate ``daily_event_df_dict``. +# We provide two options for clustering, Kernel Density Estimation (``clustering_method="kde"``) +# and K-means (``clustering_method="kmeans"``). +# In K-means, you can specify ``n_clusters`` to your desired number of groups. +# In KDE clustering, you can change the default bandwidth parameter to adjust the number of groups you get. +# Depending on the length of the time series and the number of holidays considered, we recommend a range from 5 to +# 15 groups. You can check the visualization / diagnostics via attribute ``self.result_dict["kmean_plot"]`` +# or ``self.result_dict["kde_plot"]``, respectively. +# See `~greykite.algo.common.holiday_grouper.HolidayGrouper.group_holidays` for more parameter details. +# +# Example +# ~~~~~~~ +# +# Now we look at an example with the Peyton-Manning Wiki page view data. + +import pandas as pd +import plotly +from greykite.algo.common.holiday_grouper import HolidayGrouper +from greykite.common.data_loader import DataLoader +from greykite.common.features.timeseries_features import get_holidays +from greykite.common import constants as cst + +df = DataLoader().load_peyton_manning() +df[cst.TIME_COL] = pd.to_datetime(df[cst.TIME_COL]) + +# %% +# Let's generate a list of holidays in the United States, and we +# also specify the neighboring days we want to consider in the holiday model. + +year_start = df[cst.TIME_COL].dt.year.min() - 1 +year_end = df[cst.TIME_COL].dt.year.max() + 1 +holiday_df = get_holidays(countries=["US"], year_start=year_start, year_end=year_end)["US"] + +# Defines the number of pre / post days that a holiday has impact on. +# If not specified, (0, 0) will be used. +holiday_impact_dict = { + "Christmas Day": (4, 3), # 12/25. + "Independence Day": (4, 4), # 7/4. + "Juneteenth National Independence Day": (3, 3), # 6/19. + "Labor Day": (3, 1), # Monday. + "Martin Luther King Jr. Day": (3, 1), # Monday. + "Memorial Day": (3, 1), # Monday. + "New Year's Day": (3, 4), # 1/1. + "Thanksgiving": (1, 4), # Thursday. +} + +# %% +# Now we run the holiday grouper with K-means clustering. + +# Instantiates `HolidayGrouper`. +hg = HolidayGrouper( + df=df, + time_col=cst.TIME_COL, + value_col=cst.VALUE_COL, + holiday_df=holiday_df, + holiday_date_col="date", + holiday_name_col="event_name", + holiday_impact_dict=holiday_impact_dict, + get_suffix_func="dow_grouped" +) + +# Runs holiday grouper using k-means with diagnostics. +hg.group_holidays( + baseline_offsets=[-7, 7], + min_n_days=2, + min_abs_avg_score=0.03, + clustering_method="kmeans", + n_clusters=6, + include_diagnostics=True +) + +result_dict = hg.result_dict +daily_event_df_dict = result_dict["daily_event_df_dict"] # Can be directed used in events. + +# %% +# Check results. For example, we can check the score and grouping of New Year's Day that falls on weekdays. + +hg.check_scores("New Year's Day_WD") +hg.check_holiday_group("New Year's Day_WD") + +# %% +# Check the diagnostics plot for K-means clustering. + +plotly.io.show(result_dict["kmeans_plot"]) + +# %% +# Now let's try clustering using KDE and check the results. + +hg.group_holidays( + baseline_offsets=[-7, 7], + min_n_days=1, + min_abs_avg_score=0.03, + bandwidth_multiplier=0.5, + clustering_method="kde" +) +result_dict = hg.result_dict +daily_event_df_dict = result_dict["daily_event_df_dict"] + +plotly.io.show(result_dict["kde_plot"]) +# Checks the number of events in each group. +for event_group, event_df in daily_event_df_dict.items(): + print(f"{event_group}: contains {event_df.shape[0]} days.") diff --git a/docs/nbpages/quickstart/02_interpretability/0200_interpretability.py b/docs/nbpages/quickstart/02_interpretability/0200_interpretability.py index 43af6f3..f5c1b93 100644 --- a/docs/nbpages/quickstart/02_interpretability/0200_interpretability.py +++ b/docs/nbpages/quickstart/02_interpretability/0200_interpretability.py @@ -15,12 +15,17 @@ These breakdowns then can be used to answer questions such as: -- Question 1: How is the forecast value is generated? +- Question 1: How is the forecast value generated? - Question 2: What is driving the change of the forecast as new data comes in? Forecast components can also help us analyze model behavior and sensitivity. This is because while it is not feasible to compare a large set of features across two model settings, it can be quite practical and informative to compare a few well-defined components. + +This tutorial discusses in detail the usage of ``forecast_breakdown`` and how to estimate forecast +components using custom component dictionaries. Some of this functionality has been built in estimators +using the method ``plot_components(...)``. An example of this usage is in the "Simple Forecast" tutorial +in the Quick Start. """ # required imports @@ -82,7 +87,7 @@ def prepare_bikesharing_data(): # This is needed because we are using regressors, # and future regressor data must be augmented to ``df``. # We mimic that by removal of the values of the response. - train_df.at[(len(train_df) - forecast_horizon):len(train_df), value_col] = None + train_df.loc[(len(train_df) - forecast_horizon):len(train_df), value_col] = None print(f"train_df shape: \n {train_df.shape}") print(f"test_df shape: \n {test_df.shape}") @@ -119,7 +124,7 @@ def fit_forecast( "orders_list": [[7, 7*2, 7*3]], "interval_list": [(1, 7), (8, 7*2)]}, "series_na_fill_func": lambda s: s.bfill().ffill()}, - "fast_simulation": True + "fast_simulation": True } # Changepoints configuration @@ -253,6 +258,9 @@ def fit_forecast( # If some variables do not satisfy any of the groupings, they will be grouped into "OTHER". # The following breakdown dictionary should work for many use cases. # However, the users can customize it as needed. +# +# Two alternative dictionaries are included in `~greykite.common.constants` in the variables +# ``DEFAULT_COMPONENTS_REGEX_DICT`` and ``DETAILED_SEASONALITY_COMPONENTS_REGEX_DICT``. grouping_regex_patterns_dict = { "regressors": "regressor_.*", # regressor effects diff --git a/docs/nbpages/quickstart/03_benchmark/0200_benchmark.py b/docs/nbpages/quickstart/03_benchmark/0200_benchmark.py index 7316844..3909d2c 100644 --- a/docs/nbpages/quickstart/03_benchmark/0200_benchmark.py +++ b/docs/nbpages/quickstart/03_benchmark/0200_benchmark.py @@ -93,22 +93,23 @@ # %% # Now we update ``common_config`` to specify the individual models. -# Defines ``Prophet`` model template with custom seasonality -model_components = ModelComponentsParam( - seasonality={ - "seasonality_mode": ["additive"], - "yearly_seasonality": ["auto"], - "weekly_seasonality": [True], - }, - growth={ - "growth_term": ["linear"] - } -) -param_update = dict( - model_template=ModelTemplateEnum.PROPHET.name, - model_components_param=model_components -) -Prophet = replace(common_config, **param_update) +# # The following code defines a ``Prophet`` configuration. +# # Defines ``Prophet`` model template with custom seasonality +# model_components = ModelComponentsParam( +# seasonality={ +# "seasonality_mode": ["additive"], +# "yearly_seasonality": ["auto"], +# "weekly_seasonality": [True], +# }, +# growth={ +# "growth_term": ["linear"] +# } +# ) +# param_update = dict( +# model_template=ModelTemplateEnum.PROPHET.name, +# model_components_param=model_components +# ) +# Prophet = replace(common_config, **param_update) # Defines ``Silverkite`` model template with automatic autoregression # and changepoint detection @@ -138,7 +139,7 @@ # Define the list of configs to benchmark # The dictionary keys will be used to store the benchmark results configs = { - "Prophet": Prophet, + # "Prophet": Prophet, "SK_1": Silverkite_1, "SK_2": Silverkite_2, } @@ -402,13 +403,13 @@ # Error due to incompatible model components in config # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -# regressor_cols is not part of Prophet's model components +# some_regressor is not part of Prophet's model components model_components=ModelComponentsParam( regressors={ - "regressor_cols": ["regressor1", "regressor2", "regressor_categ"] + "some_regressor": ["regressor1", "regressor2", "regressor_categ"] } ) -invalid_prophet = replace(Prophet, model_components_param=model_components) +invalid_prophet = replace(Silverkite_1, model_components_param=model_components) invalid_configs = {"invalid_prophet": invalid_prophet} bm = BenchmarkForecastConfig(df=df, configs=invalid_configs, tscv=tscv) try: @@ -421,7 +422,7 @@ # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # model template name is not part of TemplateEnum, thus invalid -unknown_template = replace(Prophet, model_template="SOME_TEMPLATE") +unknown_template = replace(Silverkite_1, model_template="SOME_TEMPLATE") invalid_configs = {"unknown_template": unknown_template} bm = BenchmarkForecastConfig(df=df, configs=invalid_configs, tscv=tscv) try: @@ -435,10 +436,10 @@ # the configs are valid by themselves, however incompatible for # benchmarking as these have different forecast horizons -Prophet_forecast_horizon_30 = replace(Prophet, forecast_horizon=30) +Silverkite_forecast_horizon_30 = replace(Silverkite_1, forecast_horizon=30) invalid_configs = { - "Prophet": Prophet, - "Prophet_30": Prophet_forecast_horizon_30 + "Silverkite": Silverkite_1, + "Silverkite_30": Silverkite_forecast_horizon_30 } bm = BenchmarkForecastConfig(df=df, configs=invalid_configs, tscv=tscv) try: @@ -450,7 +451,7 @@ # Error due to different forecast horizons in config and tscv # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -## Error due to different forecast horizons in config and tscv +# Error due to different forecast horizons in config and tscv tscv = RollingTimeSeriesSplit(forecast_horizon=15) bm = BenchmarkForecastConfig(df=df, configs=configs, tscv=tscv) try: diff --git a/docs/pages/autodoc/doc.rst b/docs/pages/autodoc/doc.rst index d5f48f4..f3249cb 100644 --- a/docs/pages/autodoc/doc.rst +++ b/docs/pages/autodoc/doc.rst @@ -95,12 +95,14 @@ EasyConfig .. autoclass:: greykite.algo.common.holiday_inferrer.HolidayInferrer +.. autoclass:: greykite.algo.common.holiday_grouper.HolidayGrouper + Changepoint Detection --------------------- .. autoclass:: greykite.algo.changepoint.adalasso.changepoint_detector.ChangepointDetector Benchmarking ----------------- +------------ .. autoclass:: greykite.framework.benchmark.benchmark_class.BenchmarkForecastConfig :members: @@ -130,19 +132,13 @@ Hierarchical Forecast Utility Functions ----------------- -.. currentmodule:: greykite.common.features.timeseries_features -.. autofunction:: get_available_holiday_lookup_countries -.. autofunction:: get_available_holidays_across_countries -.. autofunction:: build_time_features_df -.. autofunction:: get_holidays -.. autofunction:: add_event_window_multi -.. autofunction:: add_daily_events -.. autofunction:: convert_date_to_continuous_time +.. automodule:: greykite.common.features.timeseries_features .. currentmodule:: greykite.algo.forecast.silverkite.forecast_simple_silverkite_helper .. autofunction:: get_event_pred_cols .. autofunction:: greykite.framework.pipeline.utils.get_basic_pipeline +.. autofunction:: greykite.framework.utils.exploratory_data_analysis.get_exploratory_plots .. autofunction:: greykite.framework.utils.result_summary.summarize_grid_search_results .. autofunction:: greykite.framework.utils.result_summary.get_ranks_and_splits @@ -160,6 +156,7 @@ Utility Functions .. currentmodule:: greykite.common.evaluation .. autofunction:: r2_null_model_score +.. autofunction:: mean_interval_score .. currentmodule:: greykite.framework.pipeline.utils .. autofunction:: get_score_func_with_aggregation @@ -191,7 +188,7 @@ Internal Functions .. autofunction:: greykite.algo.uncertainty.conditional.conf_interval.conf_interval .. autofunction:: greykite.algo.changepoint.adalasso.changepoints_utils.combine_detected_and_custom_trend_changepoints -.. currentmodule:::: greykite.common.features.timeseries_lags +.. currentmodule:: greykite.common.features.timeseries_lags .. autofunction:: build_autoreg_df .. autofunction:: build_agg_lag_df .. autofunction:: build_autoreg_df_multi diff --git a/docs/pages/changelog/changelog.rst b/docs/pages/changelog/changelog.rst index 374794b..0318119 100644 --- a/docs/pages/changelog/changelog.rst +++ b/docs/pages/changelog/changelog.rst @@ -1,5 +1,27 @@ -Changelog -========= +0.5.0 (2023-04-03) +------------------ + +Python 3.10 support. + +* New features and methods + * Improvements on modeling holidays + * @Yi Su: Added `~greykite.algo.common.holiday_grouper.HolidayGrouper`. Holidays with a similar effect are grouped together to have fewer and more robust coefficients. With this feature, holidays are better modeled with improved forecast accuracy. See instructions in :doc:`/gallery/quickstart/01_exploration/0200_auto_configuration_tools`. + * @Kaixu Yang: Added support for holiday neighboring impact for data with frequency longer than daily through the ``daily_event_neighbor_impact`` parameter (e.g. this enables modeling holidays on weekly data where the event dates may not fall on the exact timestamps); added holiday neighboring events (i.e. the lags of an actual holiday can be specified in the model) through the ``daily_event_shifted_effect`` parameter. See details at :doc:`/pages/model_components/0400_events`. + * @Yi Su: Added holiday indicators. Now users can specify "is_event_exact", "is_event_adjacent", "is_event" (a union of both) as ``extra_pred_cols`` in the model. See details at :doc:`/pages/model_components/0400_events`. + * @Reza Hosseini: Added DST indicators. Now users can specify "us_dst" or "eu_dst" in ``extra_pred_cols``. You may also use ``get_us_dst_start/end``, ``get_eu_dst_start/end`` functions in `~greykite.common.features.timeseries_features` to get the dates. + * @Yi Su: Theoretical improvements for the volatility model in linear and ridge algorithm for more accurate variance estimate and prediction intervals. + * @Phil Gaudreau: Added new evaluation metric: `~greykite.common.evaluation.mean_interval_score`. + * @Brian Vegetabile: Enhanced components plot that consolidates previous forecast breakdown functionality. The redesign provides a cleaner visual and allows for flexible breakdowns via regular expressions. See examples in :doc:`/gallery/quickstart/0100_simple_forecast`. + +* Library enhancements + * @Kaixu Yang: Python 3.10 support. Deprecated support for lower Python versions. + * @Sayan Patra: New utility function: `~greykite.framework.utils.exploratory_data_analysis.get_exploratory_plots` to easily generate exploratory data analysis (EDA) plots in HTML. + * @Kaixu Yang: Added ``optimize_mape`` option to quantile regression. It uses 1 over y as weights in the loss function. See `~greykite.algo.common.l1_quantile_regression.QuantileRegression` for details. + +* Bug fixes + * @Qiang Fei: In case of simulation, now ``min_adimissible_value`` and ``max_adimissible_value`` will correctly cap the simulated values. Additionally, errors are propagated through simulation steps to make the intervals more accurate. + * @Yi Su, @Sayan Patra: Now ``train_end_date`` is always respected if specified by the user. Previously it got ignored if there are trailing NA’s in training data or ``anomaly_df`` imputes the anomalous points to NA. Also, now ``train_end_date`` accepts a string value. + * @Yi Su: The seasonality order now takes `None` without raising an error. It will be treated the same as `False` or zero. 0.4.0 (2022-07-15) ------------------ @@ -33,7 +55,6 @@ Changelog * @Albert Chen @Kaixu Yang @Yi Su: Speed optimization for Silverkite. * @Albert Chen @Reza Hosseini @Kaixu Yang @Sayan Patra @Yi Su: Other library enhancements and bug fixes. - 0.3.0 (2021-12-14) ------------------ diff --git a/docs/pages/model_components/0400_events.rst b/docs/pages/model_components/0400_events.rst index 4d83741..f9f4bf7 100644 --- a/docs/pages/model_components/0400_events.rst +++ b/docs/pages/model_components/0400_events.rst @@ -26,8 +26,8 @@ If you are not sure which holidays to use, start with our defaults and create a Plot forecasts against actuals, and look for large errors. If these happen on holidays, include relevant countries in ``holiday_lookup_countries`` list. -Silverkite -^^^^^^^^^^ +Silverkite Holidays +^^^^^^^^^^^^^^^^^^^ Options (defaults shown for ``SILVERKITE`` template): @@ -274,9 +274,44 @@ To customize this, you will want to see the available holidays. While holidays are specified at a daily level, you can use interactions with seasonality to capture sub-daily holiday effects. For more information, see :doc:`/pages/model_components/0600_custom`. +Holiday Indicators and Neighboring Effect +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Auto holiday -~~~~~~~~~~~~ +1. When holidays are present in the model, we allow for using holiday indicators: + +* "is_event": an indicator column which is 1 when the timestamp is either the exact holiday dates or its + adjacent days. + +* "is_event_exact": an indicator of whether the timestamp is exactly on the holiday date. + +* "is_event_adjacent": an indicator of whether the timestamp is adjacent to a holiday if ``holiday_pre_num_days`` + or ``holiday_post_num_days`` is not 0. + +You may include the interactions between such indicators and other features in ``extra_pred_cols`` like +``extra_pred_col = ["is_event:y_lag1"]``. +See more at :doc:`/pages/model_components/0600_custom`. + +Or you may use it as a conditional column in the uncertainty model ``conditional_cols = ["dow", "is_event"]``. + +2. Sometimes you may have a weekly time series or the response is daily rolling sum. +In such cases, the whole week, or the whole rolling window is impacted by a holiday within it. +We allow for modeling such holiday neighboring effect by specifying ``daily_event_neighbor_event`` in ``events``. +For example, you may use ``daily_event_neighbor_event = 6`` to model rolling 7-day sum holiday effect +in a daily time series. Or you may use +``daily_event_neighbor_impact = lambda x: [x - timedelta(days=x.isocalendar()[2] - 1) + timedelta(days=i) for i in range(7)]`` +to model a holiday effect in weekly time series. + +Note that this feature works as adding extra dates with the same event name to the holiday model, +therefore the number of events does not increase. + +3. There are also cases where you need additional events that are shifted based on existing events. +For example, if we model the week-over-week changes as response, the week after a holiday has a counter effect. +We support an easy way of adding such events using ``daily_event_shifted_effect`` parameter in ``events``. +For example, if we have an event called "Christmas Day", ``daily_event_shifted_effect=["7D"]`` will add a new +event called "Christmas Day_7D_after" which is 7 days after the Christmas Day. + +Auto Holiday and Holiday Grouper +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Silverkite models support automatically inferring significant holidays and generate holiday configurations. It utilizes the `~greykite.algo.common.holiday_inferrer.HolidayInferrer` method to infer holidays. @@ -296,6 +331,20 @@ added to any inferred significant holiday and neighboring day events. holiday_lookup_countries=["US"] ) +We also provide a Holiday Grouper tool to help you group holidays based on their estimated impact inferred from +the training data. The smart grouping makes sure to not create too many parameters to each holiday while making +sure that holidays that are different enough will be modeled separately. +For more details, see Holiday Grouper in :doc:`/gallery/quickstart/01_exploration/0200_auto_configuration_tools`. + +The Holiday Grouper returns a curated ``daily_event_df_dict`` which can be directly specified in events. + +.. code-block:: python + + events=dict( + holiday_lookup_countries=[], + daily_event_df_dict=daily_event_df_dict + ) + Prophet ^^^^^^^ diff --git a/docs/pages/stepbystep/0300_input.rst b/docs/pages/stepbystep/0300_input.rst index 015d107..58c6944 100644 --- a/docs/pages/stepbystep/0300_input.rst +++ b/docs/pages/stepbystep/0300_input.rst @@ -132,7 +132,7 @@ Make sure your data loads correctly. First, check the printed logs of ``load_dat time_col="ts", value_col="y", freq="D") # optional, but recommended if you have missing data points - # W for weekly, D for daily, H for hourly, etc. See ``pd.date_range`` + # W for weekly, D for daily, H for hourly, etc. See `pd.date_range`. Here is some example logging info for hourly data. The loaded data spans 2017-10-11 to 2020-02-23. @@ -186,7 +186,7 @@ the time series is included in the result. coverage=0.95 ) ) - ts = result.timeseries # a `UnivariateTimeSeries` + ts = result.timeseries # a `UnivariateTimeSeries` instance You can also check the information programatically: @@ -208,11 +208,8 @@ The best way to check your data is to plot it. You can do this interactively in .. code-block:: python - from plotly.offline import init_notebook_mode, iplot - init_notebook_mode(connected=True) # for generating offline graphs within Jupyter Notebook - fig = ts.plot() - iplot(fig) + fig.show() # Or simply `ts.plot()` Anomalies @@ -244,9 +241,6 @@ For example: import numpy as np import pandas as pd - from plotly.offline import init_notebook_mode, iplot - init_notebook_mode(connected=True) # for generating offline graphs within Jupyter Notebook - import greykite.common.constants as cst from greykite.framework.input.univariate_time_series import UnivariateTimeSeries @@ -269,7 +263,7 @@ For example: # "regressor3": [None, None, 2.14, 2.16, 2.17] # }) - # Specify anomalies using ``anomaly_df``. + # Specify anomalies using `anomaly_df`. # Each row corresponds to an anomaly. The start date, end date, # and impact (if known) are provided. Extra columns can be # used to annotate information such as which metrics the @@ -281,10 +275,10 @@ For example: cst.ADJUSTMENT_DELTA_COL: [-27, np.nan], cst.METRIC_COL: ["y", "regressor3"] }) - # ``anomaly_info`` dictates which columns - # in ``df`` to correct (``value_col`` below), and which rows - # in ``anomaly_df`` to use to correct them. - # Rows are filtered using ``filter_by_dict``. + # `anomaly_info` dictates which columns + # in `df` to correct (`value_col` below), and which rows + # in `anomaly_df` to use to correct them. + # Rows are filtered using `filter_by_dict`. anomaly_info = [ { "value_col": "y", @@ -300,8 +294,8 @@ For example: }, ] - # Pass ``anomaly_info`` to ``load_data``. - # Since our dataset has regressors, we pass ``regressor_cols`` as well. + # Passes `anomaly_info` to `load_data`. + # Since our dataset has regressors, we pass `regressor_cols` as well. ts = UnivariateTimeSeries() ts.load_data( df=df, @@ -313,13 +307,13 @@ For example: # Plots the dataset after correction fig = ts.plot() - iplot(fig) + fig.show() # Set show_anomaly_adjustment=True to show the dataset before correction fig = ts.plot(show_anomaly_adjustment=True) - iplot(fig) + fig.show() # The results are stored as attributes. - ts.df # dataset after correction (same as ``df_adjusted`` above) - ts.df_before_adjustment # dataset before correction (same as ``df`` above) + ts.df # dataset after correction (same as `df_adjusted` above) + ts.df_before_adjustment # dataset before correction (same as `df` above) Check trend ~~~~~~~~~~~ @@ -342,7 +336,7 @@ For example, look at weekly averages. # (7*24 for weekly aggregation of hourly data) groupby_custom_column=None, title=f"Weekly average of {value_col}") - iplot(fig) + fig.show() For a more detailed examination, including automatic changepoint detection, see :doc:`/gallery/quickstart/01_exploration/0100_changepoint_detection`. @@ -364,7 +358,7 @@ To check daily seasonality, aggregate by hour of day and plot the average: groupby_sliding_window_size=None, groupby_custom_column=None, title=f"daily seasonality: mean of {value_col}") - iplot(fig) + fig.show() To check weekly seasonality, group by day of week. @@ -378,7 +372,7 @@ To check weekly seasonality, group by day of week. groupby_sliding_window_size=None, groupby_custom_column=None, title=f"weekly seasonality: mean of {value_col}") - iplot(fig) + fig.show() To check yearly seasonality, group by week of year. @@ -392,7 +386,7 @@ To check yearly seasonality, group by week of year. groupby_sliding_window_size=None, groupby_custom_column=None, title=f"yearly seasonality: mean of {value_col}") - iplot(fig) + fig.show() To see other features to group by: see :py:func:`~greykite.common.features.timeseries_features.build_time_features_df`. diff --git a/docs/pages/stepbystep/0500_output.rst b/docs/pages/stepbystep/0500_output.rst index fd92276..a66b10c 100644 --- a/docs/pages/stepbystep/0500_output.rst +++ b/docs/pages/stepbystep/0500_output.rst @@ -117,12 +117,9 @@ You can plot the results: .. code-block:: python - from plotly.offline import init_notebook_mode, iplot - init_notebook_mode(connected=True) # for generating offline graphs within Jupyter Notebook - backtest = result.backtest fig = backtest.plot() - iplot(fig) + fig.show() Show the evaluation metrics: @@ -133,16 +130,13 @@ Show the evaluation metrics: print(backtest.test_evaluation) # hold out test set -See the component plot to understand how trend, seasonality, -and holidays are handled by the forecast: +See the component plot to understand how factors like trend, seasonality, +and holidays are handled by the forecast and to visualize changes points and residuals: .. code-block:: python - from plotly.offline import init_notebook_mode, iplot - init_notebook_mode(connected=True) - fig = backtest.plot_components() - iplot(fig) # fig.show() if you are using "PROPHET" template + fig.show() Access backtest forecasted values and prediction intervals: @@ -159,11 +153,8 @@ and their descriptions. .. code-block:: python - from plotly.offline import init_notebook_mode, iplot from greykite.common.evaluation import EvaluationMetricEnum - init_notebook_mode(connected=True) # for generating offline graphs within Jupyter Notebook - # MAPE by day of week fig = backtest.plot_grouping_evaluation( score_func=EvaluationMetricEnum.MeanAbsolutePercentError.get_metric_func(), @@ -172,7 +163,7 @@ and their descriptions. groupby_time_feature="dow", # day of week groupby_sliding_window_size=None, groupby_custom_column=None) - iplot(fig) + fig.show() # RMSE over time fig = backtest.plot_grouping_evaluation( @@ -182,15 +173,17 @@ and their descriptions. groupby_time_feature=None, groupby_sliding_window_size=7, # weekly aggregation of daily data groupby_custom_column=None) - iplot(fig) + fig.show() See `~greykite.framework.output.univariate_forecast.UnivariateForecast.plot_flexible_grouping_evaluation` for a more powerful plotting function to plot the quantiles of the error along with the mean. -You can use component plots for a concise visual representation of how the dataset's trend, seasonality -and holiday patterns are estimated by the forecast model. Currently, ``Silverkite`` calculates component -plots based on dataset passed to the ``fit`` method, whereas ``Prophet`` calculates component plots -based on dataset passed to the ``predict`` method. +You can use component plots for a concise visual representation of how factors like the dataset's trend, seasonality +and holiday patterns are estimated by the forecast model and to visualize other model information like residuals and +change points. Currently, by default, ``Silverkite`` calculates component plots based on a dataset passed to the ``fit`` +method, whereas ``Prophet`` calculates component plots based on a dataset passed to the ``predict`` method. See the +discussion below on model forecasts for more information on how to access component plots based on the dataset +passed to the ``predict`` method in ``Silverkite``. .. code-block:: python @@ -209,7 +202,7 @@ if it looks reasonable. forecast = result.forecast fig = forecast.plot() - iplot(fig) + fig.show() Show the error metrics on the training set. @@ -228,8 +221,9 @@ Access future forecasted values and prediction intervals: Just as for backtest, you can use ``forecast.plot_grouping_evaluation()`` to examine the training error by various dimensions (e.g. over time, by day of week), and ``forecast.plot_components()`` to check the trend, -seasonality and holiday effects. See -`~greykite.framework.output.univariate_forecast.UnivariateForecast` for details. +seasonality and holiday effects and visualize change points and residuals. Note that +components from the data set used in the ``predict`` call can be accessed by ``.plot_components(predict_phase=True)``. +See `~greykite.framework.output.univariate_forecast.UnivariateForecast` for details. Model diff --git a/greykite/algo/common/holiday_grouper.py b/greykite/algo/common/holiday_grouper.py new file mode 100644 index 0000000..4a1ef28 --- /dev/null +++ b/greykite/algo/common/holiday_grouper.py @@ -0,0 +1,866 @@ +# BSD 2-CLAUSE LICENSE + +# Redistribution and use in source and binary forms, with or without modification, +# are permitted provided that the following conditions are met: + +# Redistributions of source code must retain the above copyright notice, this +# list of conditions and the following disclaimer. +# Redistributions in binary form must reproduce the above copyright notice, +# this list of conditions and the following disclaimer in the documentation +# and/or other materials provided with the distribution. +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND +# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED +# WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE +# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR +# #ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES +# (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; +# LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND +# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS +# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +# original author: Yi Su +"""Automatically scores and groups holidays of similar effects.""" + +from datetime import timedelta +from typing import Any +from typing import Callable +from typing import Dict +from typing import List +from typing import Optional +from typing import Tuple +from typing import Union + +import numpy as np +import pandas as pd +from IPython.display import display +from sklearn.cluster import KMeans +from sklearn.metrics import silhouette_score +from sklearn.neighbors import KernelDensity + +from greykite.algo.common.holiday_inferrer import HolidayInferrer +from greykite.algo.common.holiday_utils import HOLIDAY_DATE_COL +from greykite.algo.common.holiday_utils import HOLIDAY_NAME_COL +from greykite.algo.common.holiday_utils import get_dow_grouped_suffix +from greykite.algo.common.holiday_utils import get_weekday_weekend_suffix +from greykite.common.constants import EVENT_DF_DATE_COL +from greykite.common.constants import EVENT_DF_LABEL_COL +from greykite.common.logging import LoggingLevelEnum +from greykite.common.logging import log_message +from greykite.common.viz.timeseries_plotting import plot_multivariate + + +class HolidayGrouper: + """This module estimates the impact of holidays and their neighboring days + given a raw holiday dataframe ``holiday_df``, and a time series containing + the observed values to construct the baselines. + It groups events with similar effects to several groups using kernel density estimation (KDE) + and generates the grouped events in a dictionary of dataframes that is recognizable by + `~greykite.algo.forecast.silverkite.forecast_silverkite.SilverkiteForecast`. + + Parameters + ---------- + df : `pandas.DataFrame` + Input time series that contains ``time_col`` and ``value_col``. + The values will be used to construct baselines to estimate the holiday impact. + time_col : `str` + Name of the time column in ``df``. + value_col : `str` + Name of the value column in ``df``. + holiday_df : `pandas.DataFrame` + Input holiday dataframe that contains the dates and names of the holidays. + holiday_date_col : `str` + Name of the holiday date column in ``holiday_df``. + holiday_name_col : `str` + Name of the holiday name column in ``holiday_df``. + holiday_impact_dict : `Dict` [`str`, Any] or None, default None + A dictionary containing the neighboring impacting days of a certain holiday. + The key is the name of the holiday matching those in the provided ``holiday_df``. + The value is a tuple of two values indicating the number of neighboring days + before and after the holiday. For example, a valid dictionary may look like: + + .. code-block:: python + + holiday_impact_dict = { + "Christmas Day": [3, 3], + "Memorial Day": [0, 0] + } + + get_suffix_func : Callable or `str` or None, default "wd_we" + A function that generates a suffix (usually a time feature e.g. "_WD" for weekday, + "_WE" for weekend) given an input date. + This can be used to estimate the interaction between floating holidays + and on which day they are getting observed. + We currently support two defaults: + + - "wd_we" to generate suffixes based on whether the day falls on weekday or weekend. + - "dow_grouped" to generate three categories: ["_WD", "_Sat", "_Sun"]. + + If None, no suffix is added. + + Attributes + ---------- + expanded_holiday_df : `pandas.DataFrame` + An expansion of ``holiday_df`` after adding the neighboring dates provided in + ``holiday_impact_dict`` and the suffix generated by ``get_suffix_func``. + For example, if ``"Christmas Day": [3, 3]`` and "wd_we" are used, events + such as "Christmas Day_WD_plus_1_WE" or "Christmas Day_WD_minus_3_WD" will be generated + for a Christmas that falls on Friday. + baseline_offsets : `Tuple`[`int`] or None + The offsets in days to calculate baselines for a given holiday. + By default, the same days of the week before and after are used. + use_relative_score : `bool` or None + Whether to use relative or absolute score when estimating the holiday impact. + clustering_method : `str` or None + Clustering method used to group the holidays. + Since we are doing 1-D clustering, current supported methods include + (1) "kde" for kernel density estimation, and (2) "kmeans" for k-means clustering. + bandwidth : `float` or None + The bandwidth used in the kernel density estimation. + Higher bandwidth results in less clusters. + If None, it is automatically inferred with the ``bandwidth_multiplier`` factor. + bandwidth_multiplier : `float` or None + Multiplier to be multiplied to the kernel density estimation's default parameter calculated from + `here_`. + This multiplier has been found useful in adjusting the default bandwidth parameter in many cases. + Only used when ``bandwidth`` is not specified. + kde : `KernelDensity` or None + The `KernelDensity` object if ``clustering_method == "kde"``. + n_clusters : `int` or None + Number of clusters in the k-means algorithm. + kmeans : `KMeans` or None + The `KMeans` object if ``clustering_method == "kmeans"``. + include_diagnostics : `bool` or None + Whether to include ``kmeans_diagnostics`` and ``kmeans_plot`` in the output ``result_dict``. + result_dict : `Dict`[`str`, Any] or None + A dictionary that stores the scores and clustering results, with the following keys. + + - "holiday_inferrer": the `~greykite.algo.common.holiday_inferrer.HolidayInferrer` + instance used for calculating the scores. + - "score_result_original": a dictionary with keys being the names of all holiday events + after expansion (i.e. the keys in ``expanded_holiday_df``), values being a list of scores + of all dates corresponding to this event. + - "score_result_avg_original": a dictionary with the same key as in + ``result_dict["score_result_original"]``. + But the values are the average scores of each event across all occurrences. + - "score_result": same as ``result_dict["score_result_original"]``, but after removing + holidays with inconsistent / negligible scores. + - "score_result_avg": same as ``result_dict["score_result_original"]``, but after removing + holidays with inconsistent / negligible scores. + - "daily_event_df_dict_with_score": a dictionary of dataframes. + Key is the group name ``"holiday_group_{k}"``. + Value is a dataframe of all holiday events in this group, containing 4 columns: + "date" (``EVENT_DF_DATE_COL``), "event_name" (``EVENT_DF_LABEL_COL``), "original_name", "avg_score". + - "daily_event_df_dict": a dictionary of dataframes that is ready to use in `SilverkiteForecast`. + Contains 2 keys: ``EVENT_DF_DATE_COL`` and ``EVENT_DF_LABEL_COL``. + - "kde_cutoffs": a list of `float`, the cutoffs returned by the kernel density clustering. + - "kde_res": a dataframe that contains "score" and "density" from the kernel density estimation. + - "kde_plot": a plot of the kernel density estimation. + - "kmeans_diagnostics": a dataframe containing metrics for different number of clusters. + Columns are: + + - "k": number of clusters; + - "wsse": within-cluster sum of squared error (lower is better); + - "sil_score": Silhouette coefficient, a value between [-1, 1] that describes + the separation of clusters (higher is better). + + Only generated when ``include_diagnostics`` is True. See `group_holidays` for details. + - "kmeans_plot": a plot visualizing how the diagnostic metrics change over K. + Only generated when ``include_diagnostics`` is True. See `group_holidays` for details. + """ + def __init__( + self, + df: pd.DataFrame, + time_col: str, + value_col: str, + holiday_df: pd.DataFrame, + holiday_date_col: str, + holiday_name_col: str, + holiday_impact_dict: Optional[Dict[str, Tuple[int, int]]] = None, + get_suffix_func: Optional[Union[Callable, str]] = "wd_we"): + self.df = df.copy() + self.time_col = time_col + self.value_col = value_col + self.holiday_df = holiday_df.copy() + self.holiday_date_col = holiday_date_col + self.holiday_name_col = holiday_name_col + if holiday_impact_dict is None: + holiday_impact_dict = {} + self.holiday_impact_dict = holiday_impact_dict.copy() + self.get_suffix_func = get_suffix_func + + # Derived attributes. + # Casts time columns to `datetime`. + self.df[time_col] = pd.to_datetime(self.df[time_col]) + self.holiday_df[holiday_date_col] = pd.to_datetime(self.holiday_df[holiday_date_col]) + # Creates `HOLIDAY_DATE_COL` and `HOLIDAY_NAME_COL` (if not exists) to be recognized by `HolidayInferrer`. + self.holiday_df[HOLIDAY_DATE_COL] = self.holiday_df[self.holiday_date_col] + self.holiday_df[HOLIDAY_NAME_COL] = self.holiday_df[self.holiday_name_col] + + # Other attributes that are not needed for initialization. + self.baseline_offsets: Optional[Tuple[int, int]] = None + self.use_relative_score: Optional[bool] = None + self.clustering_method: Optional[str] = None + self.bandwidth: Optional[float] = None + self.bandwidth_multiplier: Optional[float] = None + self.kde: Optional[KernelDensity] = None + self.n_clusters: Optional[int] = None + self.kmeans: Optional[KMeans] = None + self.include_diagnostics: Optional[bool] = None + # Result dictionary will be populated after the scoring and grouping functions are run. + self.result_dict: Optional[Dict[str, Any]] = None + + # Expands `holiday_df` to include neighboring days + # and suffixes (e.g. "_WD" for weekdays, "_Sat" for Saturdays). + self.expanded_holiday_df = self.expand_holiday_df_with_suffix( + holiday_df=self.holiday_df, + holiday_date_col=HOLIDAY_DATE_COL, + holiday_name_col=HOLIDAY_NAME_COL, + holiday_impact_dict=self.holiday_impact_dict, + get_suffix_func=self.get_suffix_func + ) + + def group_holidays( + self, + baseline_offsets: Tuple[int, int] = (-7, 7), + use_relative_score: bool = True, + min_n_days: int = 1, + min_same_sign_ratio: float = 0.66, + min_abs_avg_score: float = 0.05, + clustering_method: str = "kde", + bandwidth: Optional[float] = None, + bandwidth_multiplier: Optional[float] = 0.2, + n_clusters: Optional[int] = 5, + include_diagnostics: bool = False) -> None: + """Estimates the impact of holidays and their neighboring days and + groups events with similar effects to several groups using kernel density estimation (KDE). + Then generates the grouped events and stores the results in ``self.result_dict``. + + Parameters + ---------- + baseline_offsets : `Tuple`[`int`], default (-7, 7) + The offsets in days to calculate baselines for a given holiday. + By default, the same days of the week before and after are used. + use_relative_score : `bool`, default True + Whether to use relative or absolute score when estimating the holiday impact. + min_n_days : `int`, default 1 + Minimal number of occurrences for a holiday event to be kept before grouping. + min_same_sign_ratio : `float`, default 0.66 + Threshold of the ratio of the same-sign scores for an event's occurrences. + For example, if an event has two occurrences, they both need to have positive or negative + scores for the ratio to achieve 0.66. + Similarly, if an event has 3 occurrences, at least 2 of them must have the same directional impact. + This parameter is intended to rule out holidays that have indefinite effects. + min_abs_avg_score : `float`, default 0.05 + The minimal average score of an event (across all its occurrences) to be kept + before grouping. + When ``use_relative_score = True``, 0.05 means the effect must be greater than 5%. + clustering_method : `str`, default "kde" + Clustering method used to group the holidays. + Since we are doing 1-D clustering, current supported methods include + (1) "kde" for kernel density estimation, and (2) "kmeans" for k-means clustering. + bandwidth : `float` or None, default None + The bandwidth used in the kernel density estimation. + Higher bandwidth results in less clusters. + If None, it is automatically inferred with the ``bandwidth_multiplier`` factor. + Only used when ``clustering_method == "kde"``. + bandwidth_multiplier : `float` or None, default 0.2 + Multiplier to be multiplied to the kernel density estimation's default parameter calculated from + `here_`. + This multiplier has been found useful in adjusting the default bandwidth parameter in many cases. + Only used when ``bandwidth`` is not specified and ``clustering_method == "kde"``. + n_clusters : `int` or None, default 5 + Number of clusters in the k-means algorithm. + Only used when ``clustering_method == "kmeans"``. + include_diagnostics : `bool`, default False + Whether to include ``kmeans_diagnostics`` and ``kmeans_plot`` in the output ``result_dict``. + + Returns + ------- + Saves the results in the ``result_dict`` attribute. + """ + # Parameters should have already been set during initialization. + # If new values are provided, they will override the original values. + self.baseline_offsets = baseline_offsets + self.use_relative_score = use_relative_score + self.clustering_method = clustering_method + + # Runs baselines to get scores on holiday events. + self.result_dict = self.get_holiday_scores( + baseline_offsets=baseline_offsets, + use_relative_score=use_relative_score, + min_n_days=min_n_days, + min_same_sign_ratio=min_same_sign_ratio, + min_abs_avg_score=min_abs_avg_score + ) + # Extracts results and prepares data for kernel density estimation. + score_result_avg = self.result_dict["score_result_avg"] + scores_df_original = ( + pd.DataFrame(score_result_avg, index=["avg_score"]) + .transpose() + .reset_index(drop=False) + .rename(columns={"index": "event_name"}) + ) + # In rare cases some holiday events may fall on exactly the same day, hence the same score. + # We drop duplicated average scores before clustering, for example, + # Halloween (10/31) plus 1 always has the same scores as All Saints Day (11/1). + scores_df = scores_df_original.drop_duplicates("avg_score").sort_values("avg_score").reset_index(drop=True) + scores_x = np.array(scores_df["avg_score"]).reshape(-1, 1) + + # The following parameters are set to None unless the clustering method is called to populate them. + kde_res = None + kde_plot = None + kde_cutoffs = None + kmeans_diagnostics = None + kmeans_plot = None + + if clustering_method.lower() == "kde": + if bandwidth is None and bandwidth_multiplier is None: + raise ValueError(f"At least one of `bandwidth` or `bandwidth_multiplier` must be provided!") + if bandwidth is None: + # Automatically infers the best `bandwidth`. + std = scores_x.std() + iqr = np.percentile(scores_x, 75) - np.percentile(scores_x, 25) + sigma = min(std, iqr / 1.34) + bandwidth = 0.9 * sigma * (len(scores_x) ** (-1 / 5)) * bandwidth_multiplier + + kde = KernelDensity(kernel="gaussian", bandwidth=bandwidth).fit(scores_x) + y = np.exp(kde.score_samples(scores_x)) + kde_res = pd.DataFrame({"score": scores_df["avg_score"], "density": y.tolist()}) + kde_res = kde_res.sort_values("score").reset_index(drop=True) + kde_plot = plot_multivariate( + df=kde_res, + x_col="score", + title="Kernel density of the holiday scores", + xlabel="Holiday impact", + ylabel="Kernel density" + ) + + # Performs holiday clustering based on the kernel densities. + scores = kde_res["score"].to_list() + densities = kde_res["density"].to_list() + # Find the cutoffs such that scores <= each cutoff are grouped together. + kde_cutoffs = [] + for i, x in enumerate(densities): + if 0 < i < len(densities) - 1 and x < densities[i - 1] and x < densities[i + 1]: + kde_cutoffs.append(scores[i]) + + # The group around 0 may contain mixed signs, hence we manually add 0 as a cutoff. + # This might introduce an extra group with no scores - we will remove it later. + # Note that there are no scores smaller than `min_abs_avg_score`. + kde_cutoffs = sorted(kde_cutoffs + [0]) + + # Constructs `daily_event_df_dict` for all groups. + daily_event_df_dict_raw = { + f"holiday_group_{i}": pd.DataFrame({ + EVENT_DF_DATE_COL: [], + EVENT_DF_LABEL_COL: [], + "original_name": [], + "avg_score": [] + }) for i in range(len(kde_cutoffs) + 1)} + + for key, value in score_result_avg.items(): + # Gets group assignment. + for i, cutoff in enumerate(sorted(kde_cutoffs)): + if value <= cutoff: + break + else: # If `value` > the largest cutoff, assigns it to the last group. + i += 1 + # Now, `i` is the group this holiday belongs to. + + # Gets the dates for each event. + # Since holiday inferrer automatically adds "+0" to all events, here we remove it. + key_without_plus_minus = "_".join(key.split("_")[:-1]) + idx = self.expanded_holiday_df[HOLIDAY_NAME_COL] == key_without_plus_minus + dates = self.expanded_holiday_df.loc[idx, HOLIDAY_DATE_COL] + + # Creates `event_df`. + group_name = f"holiday_group_{i}" + event_df = pd.DataFrame({ + EVENT_DF_DATE_COL: dates, + EVENT_DF_LABEL_COL: group_name, + "original_name": key_without_plus_minus, + "avg_score": value + }) + daily_event_df_dict_raw[group_name] = daily_event_df_dict_raw[group_name].append( + event_df, ignore_index=True) + # Removes potential empty groups. + daily_event_df_dict = {} + new_idx = 0 + for i in range(len(daily_event_df_dict_raw)): + event_df = daily_event_df_dict_raw[f"holiday_group_{i}"] + if len(event_df) > 0: + daily_event_df_dict[f"holiday_group_{new_idx}"] = ( + event_df + .sort_values(by=["avg_score", EVENT_DF_DATE_COL]) + .reset_index(drop=True) + ) + new_idx += 1 + del daily_event_df_dict_raw + # Overrides the attributes in the end. + self.bandwidth = bandwidth + self.bandwidth_multiplier = bandwidth_multiplier + self.kde = kde + + elif clustering_method.lower() == "kmeans": + # Runs K-means ++ to generate group assignments. + kmeans = KMeans(n_clusters=n_clusters, random_state=0, init="k-means++").fit(scores_x) + # Predicts on the original `scores_df` since two different events may fall on the + # same dates where the score was calculated, but they may have different dates in the future. + predicted_labels = kmeans.predict(np.array(scores_df_original["avg_score"]).reshape(-1, 1)) + scores_df_with_label = scores_df_original.copy() + scores_df_with_label["labels"] = list(predicted_labels) + # Since `predicted_labels` is not ordered, we sort the groups according to the average score. + group_rank_df = scores_df_with_label.groupby(by="labels")["avg_score"].agg(np.nanmean).reset_index() + group_rank_df["group_id"] = ( + group_rank_df["avg_score"] + .rank(method="dense", ascending=True) + .astype(int) + ) - 1 # Group indices start from 0. + # Merges back to the original scores dataframe. + scores_df_with_label = scores_df_with_label.merge( + group_rank_df[["labels", "group_id"]], + on="labels", how="left") + + daily_event_df_dict = { + f"holiday_group_{i}": pd.DataFrame({ + EVENT_DF_DATE_COL: [], + EVENT_DF_LABEL_COL: [], + "original_name": [], + "avg_score": [] + }) for i in range(n_clusters)} + for key, value in score_result_avg.items(): + # Gets group assignment. + group_id = scores_df_with_label.loc[scores_df_with_label["event_name"] == key, "group_id"].values[0] + group_name = f"holiday_group_{group_id}" + + # Gets the dates for each event. + # Since holiday inferrer automatically adds "+0" to all events, here we remove it. + key_without_plus_minus = "_".join(key.split("_")[:-1]) + idx = self.expanded_holiday_df[HOLIDAY_NAME_COL] == key_without_plus_minus + dates = self.expanded_holiday_df.loc[idx, HOLIDAY_DATE_COL] + + # Creates `event_df`. + event_df = pd.DataFrame({ + EVENT_DF_DATE_COL: dates, + EVENT_DF_LABEL_COL: group_name, + "original_name": key_without_plus_minus, + "avg_score": value + }) + daily_event_df_dict[group_name] = pd.concat([daily_event_df_dict[group_name], event_df]) + # Sorts the output dataframes. + for i in range(len(daily_event_df_dict)): + event_df = daily_event_df_dict[f"holiday_group_{i}"] + daily_event_df_dict[f"holiday_group_{i}"] = ( + event_df + .sort_values(by=["avg_score", EVENT_DF_DATE_COL]) + .reset_index(drop=True) + ) + + # Generates diagnostics for K means to help choose the optimal K (`n_clusters`). + if include_diagnostics: + kmeans_diagnostics = {"k": [], "wsse": [], "sil_score": []} + for candidate_k in range(2, min(len(scores_df) // 2, 20) + 1): + tmp_model = KMeans(n_clusters=candidate_k).fit(scores_x) + # Gets Silhouette score. + kmeans_diagnostics["k"].append(candidate_k) + kmeans_diagnostics["sil_score"].append(silhouette_score( + X=scores_x, + labels=tmp_model.labels_, + metric="euclidean" + )) + # Gets within-cluster sum of squared errors. + centroids = tmp_model.cluster_centers_ + pred_clusters = tmp_model.predict(scores_x) + curr_sse = 0 + for i in range(len(scores_x)): + curr_center = centroids[pred_clusters[i]] + curr_sse += (scores_x[i, 0] - curr_center[0]) ** 2 + kmeans_diagnostics["wsse"].append(curr_sse) + kmeans_diagnostics = pd.DataFrame(kmeans_diagnostics) + kmeans_plot = plot_multivariate( + df=kmeans_diagnostics, + x_col="k", + xlabel="n_clusters", + title="K-means diagnostics
" + "(1) Within-cluster SSE: lower is better
" + "(2) Silhouette scores: higher is better" + ) + + # Overrides the attributes in the end. + self.n_clusters = n_clusters + self.kmeans = kmeans + self.include_diagnostics = include_diagnostics + else: + raise NotImplementedError(f"`clustering_method` {clustering_method} is not supported! " + f"Must be one of \"kde\" (kernel density estimation) or " + f"\"kmeans\" (k-means).") + + # Cleans up and removes duplicate dates. + new_daily_event_df_dict = { + key: (df[[EVENT_DF_DATE_COL, EVENT_DF_LABEL_COL]] + .drop_duplicates(EVENT_DF_DATE_COL) + .reset_index(drop=True)) for key, df in daily_event_df_dict.items() + } + self.result_dict.update({ + "daily_event_df_dict_with_score": daily_event_df_dict, + "daily_event_df_dict": new_daily_event_df_dict, + "kde_cutoffs": kde_cutoffs, + "kde_res": kde_res, + "kde_plot": kde_plot, + "kmeans_diagnostics": kmeans_diagnostics, + "kmeans_plot": kmeans_plot + }) + + def get_holiday_scores( + self, + baseline_offsets: Tuple[int, int] = (-7, 7), + use_relative_score: bool = True, + min_n_days: int = 1, + min_same_sign_ratio: float = 0.66, + min_abs_avg_score: float = 0.05) -> Dict[str, Any]: + """Computes the score of all holiday events and their neighboring days + in ``self.expanded_holiday_df``, by comparing their observed values with a baseline + value that is an average of the values on the days specified in ``baseline_offsets``. + If a baseline date falls on another holiday, the algorithm looks for the next + value with the same step size as the given offset, up to 3 extra iterations. + Please see more details in + `~greykite.algo.common.holiday_inferrer.HolidayInferrer._get_scores_for_holidays`. + An additional pruning step is done to remove holidays with inconsistent / negligible scores. + Both the results before and after the pruning are returned. + + Parameters + ---------- + baseline_offsets : `Tuple`[`int`], default (-7, 7) + The offsets in days to calculate baselines for a given holiday. + By default, the same days of the week before and after are used. + use_relative_score : `bool`, default True + Whether to use relative or absolute score when estimating the holiday impact. + min_n_days : `int`, default 1 + Minimal number of occurrences for a holiday event to be kept before grouping. + min_same_sign_ratio : `float`, default 0.66 + Threshold of the ratio of the same-sign scores for an event's occurrences. + For example, if an event has two occurrences, they both need to have positive or negative + scores for the ratio to achieve 0.66. + Similarly, if an event has 3 occurrences, at least 2 of them must have the same directional impact. + This parameter is intended to rule out holidays that have indefinite effects. + min_abs_avg_score : `float`, default 0.05 + The minimal average score of an event (across all its occurrences) to be kept + before grouping. + When ``use_relative_score = True``, 0.05 means the effect must be greater than 5%. + + Returns + ------- + result_dict : `Dict` [`str`, Any] + A dictionary containing the scoring results. + In particular the following keys are set: "holiday_inferrer", "score_result_original", + "score_result_avg_original", "score_result", and "score_result_avg". + Please refer to the docstring of the ``self.result_dict`` attribute of `HolidayGrouper`. + """ + # Initializes `HolidayInferrer` and sets the parameters. + hi = HolidayInferrer() + hi.df = self.df.copy() + # In `HolidayInferrer._get_scores_for_holidays`, `time_col` must be of `datetime.date` or `str` type. + hi.df[self.time_col] = pd.to_datetime(hi.df[self.time_col]).dt.date + hi.ts = set(hi.df[self.time_col]) + hi.time_col = self.time_col + hi.value_col = self.value_col + hi.baseline_offsets = baseline_offsets + hi.use_relative_score = use_relative_score + hi.pre_search_days = 0 + hi.post_search_days = 0 + hi.country_holiday_df = self.expanded_holiday_df.copy() + hi.all_holiday_dates = self.expanded_holiday_df[HOLIDAY_DATE_COL].tolist() + hi.holidays = self.expanded_holiday_df[HOLIDAY_NAME_COL].unique().tolist() + + # Gets the scores for each single date in `self.expanded_holiday_df`. + hi.score_result = hi._get_scores_for_holidays() + # Gets the average scores over multiple occurrences for each holiday in `self.expanded_holiday_df`. + hi.score_result_avg = hi._get_averaged_scores() + + # Prunes holiday that has too few datapoints or inconsistent / negligible scores. + pruned_result = self._prune_holiday_by_score( + score_result=hi.score_result, + score_result_avg=hi.score_result_avg, + min_n_days=min_n_days, + min_same_sign_ratio=min_same_sign_ratio, + min_abs_avg_score=min_abs_avg_score + ) + + # Returns result both before and after pruning. + self.result_dict = { + "holiday_inferrer": hi, + "score_result_original": hi.score_result, + "score_result_avg_original": hi.score_result_avg, + "score_result": pruned_result["score_result"], + "score_result_avg": pruned_result["score_result_avg"] + } + return self.result_dict + + def check_scores( + self, + holiday_name_pattern: str, + show_pruned: bool = True) -> None: + """Spot checks the score of certain holidays containing pattern ``holiday_name_pattern``. + Prints out the dates, individual day scores of all occurrences, + and the average scores of all matching holiday events. + Note that it only checks the keys in ``self.expanded_holiday_df``, + and it assumes `get_holiday_scores` is already run. + + Parameters + ---------- + holiday_name_pattern : `str` + Any substring of the holiday event names (``self.expanded_holiday_df[self.holiday_name_col]``). + show_pruned : `bool`, default True + Whether to show pruned holidays along with the remaining holidays. + + Returns + ------- + Prints out the dates, individual day scores of all occurrences, + and the average scores of all matching holiday events. + """ + result_dict = self.result_dict + if result_dict is None: + return + + if show_pruned: + score_result = result_dict["score_result"] + score_result_avg = result_dict["score_result_avg"] + else: + score_result = result_dict["score_result_original"] + score_result_avg = result_dict["score_result_avg_original"] + res_dict = {} + for key, value in score_result_avg.items(): + if holiday_name_pattern in key: + # `HolidayInferrer` automatically adds "+" and "-" to the end, we remove them. + key_without_plus_minus = "_".join(key.split("_")[:-1]) + dates = self.expanded_holiday_df.loc[ + self.expanded_holiday_df[HOLIDAY_NAME_COL] == key_without_plus_minus, # Uses exact matching. + HOLIDAY_DATE_COL + ] + dates = dates.dt.strftime("%Y-%m-%d").to_list() + dates = [date for date in dates if date in self.df[self.time_col].dt.strftime("%Y-%m-%d").tolist()] + # Prints out the date and impact of each day. + impacts = score_result[key] + print(f"{key_without_plus_minus}:\n" + f"Dates: {dates}\n" + f"Scores: {impacts}\n") + # Extracts average score. + res_dict[key_without_plus_minus] = value + print("Average impact:") + display(res_dict) + + def check_holiday_group( + self, + holiday_name_pattern: str = "", + holiday_groups: Optional[Union[List[int], int]] = None) -> None: + """Prints out the holiday groups that contain holidays matching ``holiday_name_pattern`` and their scores. + The searching is limited to the given ``holiday_groups``. + Note that it assumes `group_holidays` has already been run. + + Parameters + ---------- + holiday_name_pattern : `str` + Any substring of the holiday event names (``self.expanded_holiday_df[self.holiday_name_col]``). + holiday_groups : `List`[`int`] or `int`, default None + The indices of holiday groups that the searching is limited in. + If None, all groups are available to search. + + Returns + ------- + Prints out all qualifying holiday groups and their scores. + """ + result_dict = self.result_dict + if result_dict is None or "daily_event_df_dict_with_score" not in result_dict.keys(): + raise Exception(f"Method `group_holidays` must be run before using the `check_holiday_group` method.") + + daily_event_df_dict_with_score = result_dict["daily_event_df_dict_with_score"] + if holiday_groups is None: + holiday_groups = list(range(len(daily_event_df_dict_with_score))) + if isinstance(holiday_groups, int): + holiday_groups = [holiday_groups] + + is_found = False + for group_id in holiday_groups: + group_name = f"holiday_group_{group_id}" + event_df = daily_event_df_dict_with_score.get(group_name) + if event_df is not None and event_df["original_name"].str.contains(holiday_name_pattern).sum() > 0: + is_found = True + print(f"`{group_name}` contains events matching the provided pattern.\n" + f"This group includes {event_df['original_name'].nunique()} distinct events.\n") + with pd.option_context("display.max_rows", None): + display(event_df) + + if not is_found: + print(f"No matching records found given pattern {holiday_name_pattern.__repr__()} " + f"and holiday groups {holiday_groups}.") + + def _prune_holiday_by_score( + self, + score_result: Dict[str, List[float]], + score_result_avg: Dict[str, float], + min_n_days: int = 1, + min_same_sign_ratio: float = 0.66, + min_abs_avg_score: float = 0.05) -> Dict[str, Any]: + """Removes events that have too few datapoints or inconsistent / negligible scores + given ``score_result`` and ``score_result_avg``. + + Parameters + ---------- + score_result : `Dict`[`str`, `List`[`float`]] + A dictionary with keys being the names of all holiday events, + values being a list of scores of all dates corresponding to this event. + score_result_avg : `Dict`[`str`, `float`] + A dictionary with the same key as in ``result_dict["score_result_original"]``. + But the values are the average scores of each event across all occurrences. + min_n_days : `int`, default 1 + Minimal number of occurrences for a holiday event to be kept before grouping. + min_same_sign_ratio : `float`, default 0.66 + Threshold of the ratio of the same-sign scores for an event's occurrences. + For example, if an event has two occurrences, they both need to have positive or negative + scores for the ratio to achieve 0.66. + Similarly, if an event has 3 occurrences, at least 2 of them must have the same directional impact. + This parameter is intended to rule out holidays that have indefinite effects. + min_abs_avg_score : `float`, default 0.05 + The minimal average score of an event (across all its occurrences) to be kept + before grouping. + When ``use_relative_score = True``, 0.05 means the effect must be greater than 5%. + + Returns + ------- + result : `Dict`[`str`, Any] + A dictionary with two keys: "score_result", "score_result_avg", values being + the same dictionary as the input ``score_result``, ``score_result_avg``, + but only with the remaining events after pruning. + """ + res_score = {} + res_score_avg = {} + for key, value in score_result.items(): + # `key` is the name of the event. + # `value` is a list of scores, we need to check the following. + # (1) It has minimum length `min_n_days`. + if len(value) < min_n_days: + continue + + # (2) The ratio of same-sign scores is at least `min_same_sign_ratio`. + signs = [(score > 0) * 1 for score in value] + n_pos, n_neg = sum(signs), len(signs) - sum(signs) + if max(n_pos, n_neg) < min_same_sign_ratio * (n_pos + n_neg): + continue + + # (3) The average score needs to meet `min_abs_avg_score` to be included. + if abs(score_result_avg[key]) < min_abs_avg_score: + continue + + # (4) The average score is not NaN. + if np.isnan(score_result_avg[key]): + continue + + res_score[key] = value + res_score_avg[key] = score_result_avg[key] + log_message( + message=f"Holidays before pruning: {len(score_result)}; after pruning: {len(res_score)}.", + level=LoggingLevelEnum.INFO + ) + return { + "score_result": res_score, + "score_result_avg": res_score_avg + } + + @staticmethod + def expand_holiday_df_with_suffix( + holiday_df: pd.DataFrame, + holiday_date_col: str, + holiday_name_col: str, + holiday_impact_dict: Optional[Dict[str, Tuple[int, int]]] = None, + get_suffix_func: Optional[Union[Callable, str]] = "wd_we") -> pd.DataFrame: + """Expands an input holiday dataframe ``holiday_df`` to include the neighboring days + specified in ``holiday_impact_dict``. + Also adds suffixes generated by ``get_suffix_func`` to better model the effects + of events falling on different days of week. + + Parameters + ---------- + holiday_df : `pandas.DataFrame` + Input holiday dataframe that contains the dates and names of the holidays. + holiday_date_col : `str` + Name of the holiday date column in ``holiday_df``. + holiday_name_col : `str` + Name of the holiday name column in ``holiday_df``. + holiday_impact_dict : `Dict` [`str`, Any] or None, default None + A dictionary containing the neighboring impacting days of a certain holiday. + The key is the name of the holiday matching those in the provided ``holiday_df``. + The value is a tuple of two values indicating the number of neighboring days + before and after the holiday. For example, a valid dictionary may look like: + + .. code-block:: python + + holiday_impact_dict = { + "Christmas Day": [3, 3], + "Memorial Day": [0, 0] + } + + get_suffix_func : Callable or `str` or None, default "wd_we" + A function that generates a suffix (usually a time feature e.g. "_WD" for weekday, + "_WE" for weekend) given an input date. + This can be used to estimate the interaction between floating holidays + and on which day they are getting observed. + We currently support two defaults: + + - "wd_we" to generate suffixes based on whether the day falls on weekday or weekend. + - "dow_grouped" to generate three categories: ["_WD", "_Sat", "_Sun"]. + + If None, no suffix is added. + + Returns + ------- + expanded_holiday_df : `pandas.DataFrame` + An expansion of ``holiday_df`` after adding the neighboring dates provided in + ``holiday_impact_dict`` and the suffix generated by ``get_suffix_func``. + For example, if ``"Christmas Day": [3, 3]`` and "wd_we" are used, events + such as "Christmas Day_WD_plus_1_WE" or "Christmas Day_WD_minus_3_WD" will be generated + for a Christmas that falls on Friday. + """ + error_message = f"`get_suffix_func` {get_suffix_func.__repr__()} is not supported! " \ + f"Only supports None, Callable, \"dow_grouped\" or \"wd_we\"." + if get_suffix_func is None: + def get_suffix_func(x): return "" + elif get_suffix_func == "wd_we": + get_suffix_func = get_weekday_weekend_suffix + elif get_suffix_func == "dow_grouped": + get_suffix_func = get_dow_grouped_suffix + elif isinstance(get_suffix_func, Callable): + get_suffix_func = get_suffix_func + else: + raise NotImplementedError(error_message) + + if holiday_impact_dict is None: + holiday_impact_dict = {} + + expanded_holiday_df = pd.DataFrame() + for _, row in holiday_df.iterrows(): + # Handles different holidays differently. + if row[holiday_name_col] in holiday_impact_dict.keys(): + pre_search_days, post_search_days = holiday_impact_dict[row[holiday_name_col]] + else: + pre_search_days, post_search_days = 0, 0 + + for i in range(-pre_search_days, post_search_days + 1): + original_dow_flag = get_suffix_func(row[holiday_date_col]) + new_ts = (row[holiday_date_col] + timedelta(days=1) * i) + new_dow_flag = get_suffix_func(new_ts) + if i < 0: + suffix = f"{original_dow_flag}_minus_{-i}{new_dow_flag}" + elif i > 0: + suffix = f"{original_dow_flag}_plus_{i}{new_dow_flag}" + else: + suffix = f"{original_dow_flag}" + new_row = { + holiday_date_col: new_ts, + holiday_name_col: f"{row[holiday_name_col]}{suffix}", + } + expanded_holiday_df = pd.concat([ + expanded_holiday_df, + pd.DataFrame.from_dict({k: [v] for k, v in new_row.items()}) + ], ignore_index=True) + + return expanded_holiday_df.sort_values(holiday_date_col).reset_index(drop=True) diff --git a/greykite/algo/common/holiday_inferrer.py b/greykite/algo/common/holiday_inferrer.py index f669911..c69f73b 100644 --- a/greykite/algo/common/holiday_inferrer.py +++ b/greykite/algo/common/holiday_inferrer.py @@ -116,6 +116,8 @@ class HolidayInferrer: This is the output from `pypi:holidays-ext`. Duplicates are dropped. Observed holidays are merged. + all_holiday_dates : `list` [`datetime`] or None + All holiday dates contained in ``country_holiday_df``. holidays : `list` [`str`] or None A list of holidays in ``country_holiday_df``. score_result : `dict` [`str`, `list` [`float`]] or None @@ -149,6 +151,7 @@ def __init__(self): self.independent_holiday_thres: Optional[float] = None self.together_holiday_thres: Optional[float] = None self.extra_years: Optional[int] = None + self.use_relative_score: Optional[bool] = None # Data set info self.df: Optional[pd.DataFrame] = None self.time_col: Optional[str] = None @@ -158,6 +161,7 @@ def __init__(self): self.ts: Optional[Set[datetime.date]] = None # Derived results self.country_holiday_df: Optional[pd.DataFrame] = None + self.all_holiday_dates: Optional[List[datetime.date]] = None self.holidays: Optional[List[str]] = None self.score_result: Optional[Dict[str, List[float]]] = None self.score_result_avg: Optional[Dict[str, float]] = None @@ -175,7 +179,8 @@ def infer_holidays( plot: bool = False, independent_holiday_thres: float = 0.8, together_holiday_thres: float = 0.99, - extra_years: int = 2) -> Optional[Dict[str, any]]: + extra_years: int = 2, + use_relative_score: bool = False) -> Optional[Dict[str, any]]: """Infers significant holidays and holiday configurations. The class works for daily and sub-daily data. @@ -235,6 +240,11 @@ def infer_holidays( extra_years : `int`, default 2 Extra years after ``self.year_end`` to pull holidays in ``self.country_holiday_df``. This can be used to cover the forecast periods. + use_relative_score : `bool`, default False + Whether the holiday effect is calculated as a relative ratio. + If `False`, `~greykite.algo.common.holiday_inferrer.HolidayInferrer._get_score_for_dates` + will use absolute difference compared to the baseline as the score. + If `True`, it uses relative ratio compared to the baseline as the score. Returns ------- @@ -272,6 +282,7 @@ def infer_holidays( # At least 1 year for completeness. raise ValueError("The parameter 'extra_years' must be a positive integer.") self.extra_years = extra_years + self.use_relative_score = use_relative_score # Pre-processes data. df = df.copy() @@ -304,6 +315,7 @@ def infer_holidays( # Gets holiday candidates. self.country_holiday_df, self.holidays = self._get_candidate_holidays(countries=countries) + self.all_holiday_dates = self.country_holiday_df["ts"].tolist() # Gets scores for holidays. self.score_result = self._get_scores_for_holidays() # Gets the average scores over multiple occurrences for each holiday. @@ -479,6 +491,7 @@ def _get_score_for_dates( """ scores = [] for date in event_dates: + log_message(message=f"Current holiday date: {date}.\n", level=LoggingLevelEnum.DEBUG) # Calculates the dates for baseline. baseline_dates = [] for offset in self.baseline_offsets: @@ -486,14 +499,21 @@ def _get_score_for_dates( counter = 1 # If a baseline date falls on another holiday, it is moving further. # But the total iterations cannot exceed 3. - while new_date in event_dates and counter < 3: + while new_date in self.all_holiday_dates and counter <= 3: + log_message( + message=f"Skipping {new_date}, new date is {new_date + timedelta(days=offset)}.\n", + level=LoggingLevelEnum.DEBUG + ) counter += 1 new_date += timedelta(days=offset) baseline_dates.append(new_date) + log_message(message=f"Baseline dates are: {baseline_dates}.\n", level=LoggingLevelEnum.DEBUG) # Calculates the average of the baseline observations. baseline = self.df[self.df[self.time_col].isin(baseline_dates)][self.value_col].mean() # Calculates the score for the current occurrence. score = self.df[self.df[self.time_col] == date][self.value_col].values[0] - baseline + if self.use_relative_score: + score /= baseline scores.append(score) return scores diff --git a/greykite/algo/common/holiday_utils.py b/greykite/algo/common/holiday_utils.py new file mode 100644 index 0000000..6792ea2 --- /dev/null +++ b/greykite/algo/common/holiday_utils.py @@ -0,0 +1,85 @@ +# BSD 2-CLAUSE LICENSE + +# Redistribution and use in source and binary forms, with or without modification, +# are permitted provided that the following conditions are met: + +# Redistributions of source code must retain the above copyright notice, this +# list of conditions and the following disclaimer. +# Redistributions in binary form must reproduce the above copyright notice, +# this list of conditions and the following disclaimer in the documentation +# and/or other materials provided with the distribution. +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND +# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED +# WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE +# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR +# #ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES +# (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; +# LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND +# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS +# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +# original author: Yi Su +"""Constants and utility functions for `HolidayInferrer` and `HolidayGrouper`.""" + +import datetime + + +HOLIDAY_NAME_COL = "country_holiday" +"""Holiday name column used in `HolidayInferrer`. +This comes from the default output of `holidays_ext.get_holidays.get_holiday_df`. +""" + +HOLIDAY_DATE_COL = "ts" +"""Holiday date column used in `HolidayInferrer`. +This comes from the default output of `holidays_ext.get_holidays.get_holiday_df`. +""" + +HOLIDAY_IMPACT_DICT = { + "Halloween": (1, 1), # 10/31. + "New Year's Day": (3, 3), # 1/1. +} +"""An example. Number of pre/post days that a holiday has impact on. +For example, if Halloween has neighbor (1, 1), Halloween_minus_1 +and Halloween_plus_1 will be generated as two additional events. +If not specified, (0, 0) will be used. +""" + + +def get_dow_grouped_suffix(date: datetime.datetime) -> str: + """Utility function to generate a suffix given an input ``date``. + + Parameters + ---------- + date : `datetime.datetime` + Input timestamp. + + Returns + ------- + suffix : `str` + The suffix string starting with "_". + """ + if date.day_name() == "Saturday": + return "_Sat" + elif date.day_name() == "Sunday": + return "_Sun" + else: + return "_WD" + + +def get_weekday_weekend_suffix(date: datetime.datetime) -> str: + """Utility function to generate a suffix given an input ``date``. + + Parameters + ---------- + date : `datetime.datetime` + Input timestamp. + + Returns + ------- + suffix : `str` + The suffix string starting with "_". + """ + if date.day_name() in ("Saturday", "Sunday"): + return "_WE" + else: + return "_WD" diff --git a/greykite/algo/common/l1_quantile_regression.py b/greykite/algo/common/l1_quantile_regression.py index 47f0935..626fa96 100644 --- a/greykite/algo/common/l1_quantile_regression.py +++ b/greykite/algo/common/l1_quantile_regression.py @@ -32,6 +32,7 @@ import cvxpy as cp import numpy as np import pandas as pd +import scipy from sklearn.base import BaseEstimator from sklearn.base import RegressorMixin @@ -95,7 +96,12 @@ def ordinary_quantile_regression( # Solves with formula. # Multiplies weights to x first to speed up. x_l = x / s - beta_new = np.linalg.pinv(x_l.T @ x) @ (x_l.T @ y) + # Tries to solve with `scipy.linalg.solve`, which gives a more stable solution. + # In cases it fails, uses `numpy.linalg.pinv`. + try: + beta_new = scipy.linalg.solve(x_l.T @ x, x_l.T @ y, assume_a="pos") + except np.linalg.LinAlgError: + beta_new = np.linalg.pinv(x_l.T @ x) @ (x_l.T @ y) err = (np.abs(beta_new - beta) / np.abs(beta)).max() beta = beta_new if err < tol: @@ -215,7 +221,8 @@ def __init__( feature_weight: Optional[np.typing.ArrayLike] = None, max_iter: int = 100, tol: float = 1e-2, - fit_intercept: bool = True): + fit_intercept: bool = True, + optimize_mape: bool = False): self.quantile = quantile self.alpha = alpha self.sample_weight = sample_weight @@ -223,6 +230,7 @@ def __init__( self.max_iter = max_iter self.tol = tol self.fit_intercept = fit_intercept + self.optimize_mape = optimize_mape # Parameters, set by ``fit`` method. self.n = None @@ -301,6 +309,19 @@ def _process_input( self.nonconstant_cols = [i for i in range(self.p) if i not in self.constant_cols] if self.alpha == 0: x = np.concatenate([np.ones([self.n, 1]), x[:, self.nonconstant_cols]], axis=1) + + # If ``optimize_mape`` is ``True``, this overrides ``quantile`` and ``sample_weight``. + if self.optimize_mape: + log_message( + message="The parameter 'optimize_mape' is set to 'True', " + "ignoring the input 'quantile' and 'sample_weight', " + "setting 'quantile' to 0.5 and 'sample_weight' to the inverse " + "absolute values of the response.", + level=LoggingLevelEnum.WARNING + ) + self.sample_weight = 1 / np.abs(y) + self.quantile = 0.5 + return x, y def fit( diff --git a/greykite/algo/common/ml_models.py b/greykite/algo/common/ml_models.py index 58a4afc..0c5b3a8 100644 --- a/greykite/algo/common/ml_models.py +++ b/greykite/algo/common/ml_models.py @@ -18,19 +18,24 @@ # ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS # SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -# original author: Reza Hosseini +# original author: Reza Hosseini, Yi Su """Functions to fit a machine learning model and use it for prediction. """ import random import re +import traceback import warnings +from typing import Dict +from typing import List +from typing import Optional import matplotlib import numpy as np import pandas as pd import patsy +import scipy import statsmodels.api as sm from pandas.plotting import register_matplotlib_converters from sklearn.ensemble import GradientBoostingRegressor @@ -50,56 +55,134 @@ from greykite.common.evaluation import calc_pred_err from greykite.common.evaluation import r2_null_model_score from greykite.common.features.normalize import normalize_df +from greykite.common.logging import LoggingLevelEnum +from greykite.common.logging import log_message from greykite.common.python_utils import group_strs_with_regex_patterns from greykite.common.viz.timeseries_plotting import plot_multivariate matplotlib.use("agg") # noqa: E402 -import matplotlib.pyplot as plt # isort:skip +import matplotlib.pyplot as plt # isort:skip # noqa: E402 register_matplotlib_converters() +def get_intercept_col_from_design_mat( + x_mat: pd.DataFrame) -> Optional[str]: + """Gets the explicit or implicit intercept column name from `patsy` design matrix. + + By default, `patsy` will make the design matrix always full rank. + It will always include an intercept term unless we specify "-1" or "+0". + However, if there are categorical variables, even we specify "-1" or "+0", + it will include an implicit intercept by adding all levels of a categorical + variable into the design matrix. + + The logic in patsy is that when intercept is excluded, + always the first categorical variable in the formula string will have all levels. + The levels are ordered in alphabetical order. + In this case, we will search for the first categorical variable + and remove its first level. + + Parameters + ---------- + x_mat : `pandas.DataFrame` + The design matrix built by `patsy`. + Must have attribute ``design_info``. + + Returns + ------- + name : `str` or None + The column name of explicit or implicit intercept in ``x_mat``. + """ + design_info = getattr(x_mat, "design_info", None) + name = None + if design_info is not None: + terms = design_info.terms + # Checks if intercept is in the design matrix. + if patsy.desc.Term([]) in terms: + name = patsy.desc.Term([]).name() # Name is "Intercept". + else: + # Intercept is not in design matrix, + # finds the implicit intercept. + # `patsy` orders categorical variables first, + # and the first categorical variable has all levels. + # We remove the first level. + for _, idx_slice in design_info.term_name_slices.items(): + # We only need to iterate the first element. + if idx_slice.stop - idx_slice.start > 1: + # Sets name to the first column only when the term has + # more than 1 columns (ignores no-categorical case). + name = list(x_mat.columns)[idx_slice.start] + break + return name + + def design_mat_from_formula( - df, - model_formula_str, - y_col=None, - pred_cols=None): + df: pd.DataFrame, + model_formula_str: str, + y_col: Optional[str] = None, + pred_cols: Optional[List[str]] = None, + remove_intercept: bool = False) -> Dict: """ Given a formula it extracts the response vector (y) - and builds the design matrix (x_mat) + and builds the design matrix (x_mat). - :param df: pd.DataFrame - A dataframe with the response vector (y) and the feature columns (x_mat) - :param model_formula_str: str + Parameters + ---------- + df : `pandas.DataFrame` + A dataframe with the response vector (y) and the feature columns (x_mat). + model_formula_str : `str` A formula string e.g. "y~x1+x2+x3*x4". This is similar to R formulas. See https://patsy.readthedocs.io/en/latest/formulas.html#how-formulas-work. - :param y_col: str + y_col : `str` or None, default None The column name which has the value of interest to be forecasted. If the model_formula_str is not passed, y_col e.g. ["y"] is used - as the response vector column - :param pred_cols: List[str] - The names of the feature columns + as the response vector column. + pred_cols : `list` [`str`] or None, default None + The names of the feature columns. If the model_formula_str is not passed, pred_cols e.g. - ["x1", "x2", "x3"] is used as the design matrix columns - :return: dict - "y": The response vector - "y_col": Name of the response column (y) - "x_mat": A design matrix - "pred_cols": Name of the columns of the design matrix (x_mat) - "x_design_info": Information for design matrix + ["x1", "x2", "x3"] is used as the design matrix columns. + remove_intercept : `bool`, default False + Whether to remove explicit and implicit intercepts. + By default, `patsy` will make the design matrix always full rank. + It will always include an intercept term unless we specify "-1" or "+0". + However, if there are categorical variables, even we specify "-1" or "+0", + it will include an implicit intercept by adding all levels of a categorical + variable into the design matrix. + Sometimes we don't want this to happen. + Setting this parameter to True will remove both explicit and implicit intercepts. + + Returns + ------- + result : `dict` + Result dictionary with the following keys: + + - "y": The response vector. + - "y_col": Name of the response column (y). + - "x_mat": A design matrix. + - "pred_cols": Name of the columns of the design matrix (x_mat). + - "x_design_info": Information for design matrix. + - "drop_intercept_col": The intercept column to be dropped. + """ + intercept_col = None if model_formula_str is not None: y, x_mat = patsy.dmatrices( model_formula_str, data=df, return_type="dataframe") + x_design_info = x_mat.design_info + if remove_intercept: + intercept_col = get_intercept_col_from_design_mat( + x_mat=x_mat + ) + if intercept_col is not None: + x_mat = x_mat.drop(columns=intercept_col) pred_cols = list(x_mat.columns) - # get the response column name using "~" location + # Gets the response column name using "~" location. y_col = re.search("(.*)~", model_formula_str).group(1).strip(" ") y = y[y.columns[0]] - x_design_info = x_mat.design_info elif y_col is not None and pred_cols is not None: y = df[y_col] x_mat = df[pred_cols] @@ -113,7 +196,9 @@ def design_mat_from_formula( "y_col": y_col, "x_mat": x_mat, "pred_cols": pred_cols, - "x_design_info": x_design_info} + "x_design_info": x_design_info, + "drop_intercept_col": intercept_col + } def fit_model_via_design_matrix( @@ -261,6 +346,38 @@ def fit_model_via_design_matrix( return ml_model +def get_h_mat(x_mat, alpha): + """Computes the H matrix given ``x_mat`` and ``alpha`` for linear and ridge regression. + The formula is ``H = inv(X.T @ X + alpha * np.eye(p)) @ X.T``. + + Parameters + ---------- + x_mat : `numpy.ndarray` or `pandas.DataFrame` + Design matrix, dimension n by p. + alpha : `float` + The regularization term from the linear / ridge regression. + Note that the OLS (ridge) estimator is ``inv(X.T @ X + alpha * np.eye(p)) @ X.T @ Y =: H @ Y``. + + Returns + ------- + h_mat : `numpy.ndarray` + The H matrix as defined above. Dimension is p by n. + """ + X = np.array(x_mat) + p = X.shape[1] + XTX_alpha = X.T @ X + np.diag([alpha] * p) + log_cond = np.log10(np.linalg.cond(XTX_alpha)) + digits_to_lose = 8 + # When `log_cond` is small, the matrix is full rank and not near singular, + # in this case we should use `solve` for a positive definite matrix to optimize efficiency. + # When `log_cond` is large, the matrix is near singular, we use `pinv` instead. + if log_cond < digits_to_lose: + h_mat = scipy.linalg.solve(XTX_alpha, X.T, assume_a="pos") + else: + h_mat = scipy.linalg.pinvh(XTX_alpha) @ X.T + return h_mat + + def fit_ml_model( df, model_formula_str=None, @@ -272,7 +389,8 @@ def fit_ml_model( max_admissible_value=None, uncertainty_dict=None, normalize_method="zero_to_one", - regression_weight_col=None): + regression_weight_col=None, + remove_intercept=False): """Fits predictive ML (machine learning) models to continuous response vector (given in ``y_col``) and returns fitted model. @@ -328,6 +446,15 @@ def fit_ml_model( regression_weight_col : `str` or None, default None The column name for the weights to be used in weighted regression version of applicable machine-learning models. + remove_intercept : `bool`, default False + Whether to remove explicit and implicit intercepts. + By default, `patsy` will make the design matrix always full rank. + It will always include an intercept term unless we specify "-1" or "+0". + However, if there are categorical variables, even we specify "-1" or "+0", + it will include an implicit intercept by adding all levels of a categorical + variable into the design matrix. + Sometimes we don't want this to happen. + Setting this parameter to True will remove both explicit and implicit intercepts. Returns ------- @@ -350,12 +477,14 @@ def fit_ml_model( """ - # build model matrices + # Builds model matrices. res = design_mat_from_formula( df=df, model_formula_str=model_formula_str, y_col=y_col, - pred_cols=pred_cols) + pred_cols=pred_cols, + remove_intercept=remove_intercept + ) y = res["y"] y_mean = np.mean(y) @@ -363,6 +492,7 @@ def fit_ml_model( x_mat = res["x_mat"] y_col = res["y_col"] x_design_info = res["x_design_info"] + drop_intercept_col = res["drop_intercept_col"] normalize_df_func = None if normalize_method is not None: @@ -387,7 +517,7 @@ def fit_ml_model( f"The column {regression_weight_col} includes negative values.") sample_weight = df[regression_weight_col] - # prediction model generated by using all observed data + # Prediction model generated by using all observed data. ml_model = fit_model_via_design_matrix( x_train=x_mat, y_train=y, @@ -395,17 +525,86 @@ def fit_ml_model( fit_algorithm_params=fit_algorithm_params, sample_weight=sample_weight) - # uncertainty model is fitted if uncertainty_dict is passed + # Obtains `alpha`, `p_effective`, `h_mat` (H), and `sigma_scaler`. + # See comments below the variables. + # Read more at https://online.stat.psu.edu/stat508/lesson/5/5.1 or + # book: “Applied Regression Analysis” by Norman R. Draper, Harry Smith. + alpha = None + """The regularization term from the linear / ridge regression. + Note that the OLS (ridge) estimator is ``inv(X.T @ X + alpha * np.eye(p)) @ X.T @ Y =: H @ Y``. + """ + p_effective = None + """Effective number of parameters. + In linear regressions, it is also equal to ``trace(X @ H)``, where H is defined above. + ``X @ H`` is also called the hat matrix. + """ + h_mat = None + """The H matrix (p by n) in linear regression estimator, as defined above. + Note that H is not necessarily of full-rank p even in ridge regression. + ``H = inv(X.T @ X + alpha * np.eye(p)) @ X.T``. + """ + sigma_scaler = None + """Theoretical scaler of the estimated sigma. + Volatility model estimates sigma by taking the sample standard deviation, and + we need to scale it by ``np.sqrt((n_train - 1) / (n_train - p_effective))`` to obtain + an unbiased estimator. + """ + x_mean = None + """Column mean of ``x_mat`` as a row vector. + This is stored and used in ridge regression to compute the prediction intervals. + In other methods, it is set to `None`. + """ + if fit_algorithm in ["ridge", "linear"]: + X = np.array(x_mat) + n_train, p = X.shape + # Extracts `alpha` from the fitted ML model. + # In linear regression, the rank of the design matrix is `p_effective`, + # but `RidgeCV` we need to manually derive it by taking the trace. + # Note that `RidgeCV` centers `X` and `Y` before fitting, hence we need to center `X` too. + if fit_algorithm == "ridge": + alpha = ml_model.alpha_ + x_mean = X.mean(axis=0).reshape(1, -1) + X = X - x_mean + else: + alpha = 0 + p_effective = np.linalg.matrix_rank(X) + # Computes `h_mat` (H, p x n). + try: + h_mat = get_h_mat(x_mat=X, alpha=alpha) + if fit_algorithm == "ridge": + # Computes the effective number of parameters. + # Note that `p_effective` is the trace of `X @ h_mat` plus 1 for intercept, however + # computing `trace(h_mat @ X)` is more efficient due to much faster matrix multiplication. + p_effective = round(np.trace(h_mat @ X), 6) + 1 # Avoids floating issues e.g. 1.9999999999999998. + except np.linalg.LinAlgError as e: + message = traceback.format_exc() + warning_msg = f"Error '{e}' occurred when computing `h_mat`, no variance scaling is done!\n" \ + f"{message}" + log_message(warning_msg, LoggingLevelEnum.WARNING) + warnings.warn(warning_msg) + + if p_effective is not None and round(p_effective) < n_train: + # Computes scaler on sigma estimate. + sigma_scaler = np.sqrt((n_train - 1) / (n_train - p_effective)) + else: + warnings.warn(f"Zero degrees of freedom ({n_train}-{p_effective}) or the inverse solver failed. " + f"Likely caused by singular `X.T @ X + alpha * np.eye(p)`. " + f"Please check \"x_mat\", \"alpha\". " + f"`sigma_scaler` cannot be computed!") + + # Uncertainty model is fitted if `uncertainty_dict` is passed. uncertainty_model = None if uncertainty_dict is not None: uncertainty_method = uncertainty_dict["uncertainty_method"] if uncertainty_method == "simple_conditional_residuals": - # reset index to match behavior of predict before assignment + # Resets index to match behavior of predict before assignment. new_df = df.reset_index(drop=True) (new_x_mat,) = patsy.build_design_matrices( [x_design_info], data=new_df, return_type="dataframe") + if drop_intercept_col is not None: + new_x_mat = new_x_mat.drop(columns=drop_intercept_col) if normalize_df_func is not None: if "Intercept" in list(x_mat.columns): cols = [col for col in list(x_mat.columns) if col != "Intercept"] @@ -416,8 +615,8 @@ def fit_ml_model( new_df[f"{y_col}_pred"] = ml_model.predict(new_x_mat) new_df[RESIDUAL_COL] = new_df[y_col] - new_df[f"{y_col}_pred"] - # re-assign some param defaults for function conf_interval - # with values best suited to this case + # Re-assigns some param defaults for function `conf_interval` + # with values best suited to this case. conf_interval_params = { "quantiles": [0.025, 0.975], "sample_size_thresh": 10} @@ -428,6 +627,9 @@ def fit_ml_model( df=new_df, distribution_col=RESIDUAL_COL, offset_col=y_col, + sigma_scaler=sigma_scaler, + h_mat=h_mat, + x_mean=x_mean, min_admissible_value=min_admissible_value, max_admissible_value=max_admissible_value, **conf_interval_params) @@ -436,8 +638,8 @@ def fit_ml_model( f"uncertainty method: {uncertainty_method} is not implemented") # We get the model summary for a subset of models - # where summary is available (statsmodels module), - # or summary can be constructed (a subset of models from sklearn). + # where summary is available (`statsmodels` module), + # or summary can be constructed (a subset of models from `sklearn`). ml_model_summary = None if "statsmodels" in fit_algorithm: ml_model_summary = ml_model.summary() @@ -461,7 +663,12 @@ def fit_ml_model( "min_admissible_value": min_admissible_value, "max_admissible_value": max_admissible_value, "normalize_df_func": normalize_df_func, - "regression_weight_col": regression_weight_col} + "regression_weight_col": regression_weight_col, + "drop_intercept_col": drop_intercept_col, + "alpha": alpha, + "h_mat": h_mat, + "p_effective": p_effective, + "sigma_scaler": sigma_scaler} if uncertainty_dict is None: fitted_df = predict_ml( @@ -501,6 +708,7 @@ def predict_ml( y_col = trained_model["y_col"] ml_model = trained_model["ml_model"] x_design_info = trained_model["x_design_info"] + drop_intercept_col = trained_model["drop_intercept_col"] min_admissible_value = trained_model["min_admissible_value"] max_admissible_value = trained_model["max_admissible_value"] @@ -510,6 +718,8 @@ def predict_ml( [x_design_info], data=fut_df, return_type="dataframe") + if drop_intercept_col is not None: + x_mat = x_mat.drop(columns=drop_intercept_col) if trained_model["normalize_df_func"] is not None: if "Intercept" in list(x_mat.columns): cols = [col for col in list(x_mat.columns) if col != "Intercept"] @@ -552,7 +762,7 @@ def predict_ml_with_uncertainty( - "x_mat": `patsy.design_info.DesignMatrix` Design matrix of the predictive model """ - # gets point predictions + # Gets point predictions. fut_df = fut_df.reset_index(drop=True) y_col = trained_model["y_col"] pred_res = predict_ml( @@ -564,10 +774,11 @@ def predict_ml_with_uncertainty( fut_df[y_col] = y_pred.tolist() - # apply uncertainty model + # Applies uncertainty model. pred_df_with_uncertainty = predict_ci( fut_df, - trained_model["uncertainty_model"]) + trained_model["uncertainty_model"], + x_mat=x_mat) return { "fut_df": pred_df_with_uncertainty, @@ -589,7 +800,8 @@ def fit_ml_model_with_evaluation( max_admissible_value=None, uncertainty_dict=None, normalize_method="zero_to_one", - regression_weight_col=None): + regression_weight_col=None, + remove_intercept=False): """Fits prediction models to continuous response vector (y) and report results. @@ -655,6 +867,15 @@ def fit_ml_model_with_evaluation( regression_weight_col : `str` or None, default None The column name for the weights to be used in weighted regression version of applicable machine-learning models. + remove_intercept : `bool`, default False + Whether to remove explicit and implicit intercepts. + By default, `patsy` will make the design matrix always full rank. + It will always include an intercept term unless we specify "-1" or "+0". + However, if there are categorical variables, even we specify "-1" or "+0", + it will include an implicit intercept by adding all levels of a categorical + variable into the design matrix. + Sometimes we don't want this to happen. + Setting this parameter to True will remove both explicit and implicit intercepts. Returns ------- @@ -712,7 +933,8 @@ def fit_ml_model_with_evaluation( max_admissible_value=max_admissible_value, uncertainty_dict=uncertainty_dict, normalize_method=normalize_method, - regression_weight_col=regression_weight_col) + regression_weight_col=regression_weight_col, + remove_intercept=remove_intercept) # we store the obtained ``y_col`` from the function in a new variable (``y_col_final``) # this is done since the input y_col could be None @@ -780,7 +1002,8 @@ def plt_pred(): max_admissible_value=max_admissible_value, uncertainty_dict=uncertainty_dict, normalize_method=normalize_method, - regression_weight_col=regression_weight_col) + regression_weight_col=regression_weight_col, + remove_intercept=remove_intercept) y_train_pred = predict_ml( fut_df=df_train, diff --git a/greykite/algo/common/partial_regularize_regression.py b/greykite/algo/common/partial_regularize_regression.py index 7e61549..407b557 100644 --- a/greykite/algo/common/partial_regularize_regression.py +++ b/greykite/algo/common/partial_regularize_regression.py @@ -535,9 +535,8 @@ def _fit_analytic(self, x, y, l2_alpha, cv): lasso_input = self._get_lasso_input(x_train, y_train, l2_alpha) if l1_alphas is not None: # Fits all `l1_alphas`. - path = LassoCV().path( + path = LassoCV(fit_intercept=False).path( alphas=l1_alphas, - fit_intercept=False, X=lasso_input["x_lasso"], y=lasso_input["y_lasso"] ) diff --git a/greykite/algo/forecast/silverkite/forecast_silverkite.py b/greykite/algo/forecast/silverkite/forecast_silverkite.py index 2cfb24e..97684fe 100644 --- a/greykite/algo/forecast/silverkite/forecast_silverkite.py +++ b/greykite/algo/forecast/silverkite/forecast_silverkite.py @@ -97,6 +97,8 @@ def forecast( fit_algorithm="linear", fit_algorithm_params=None, daily_event_df_dict=None, + daily_event_neighbor_impact=None, + daily_event_shifted_effect=None, fs_components_df=pd.DataFrame({ "name": [ TimeFeaturesEnum.tod.value, @@ -121,7 +123,8 @@ def forecast( forecast_horizon=None, simulation_based=False, simulation_num=10, - fast_simulation=False): + fast_simulation=False, + remove_intercept=False): """A function for forecasting. It captures growth, seasonality, holidays and other patterns. See "Capturing the time-dependence in the precipitation process for @@ -279,6 +282,37 @@ def forecast( Do not use `~greykite.common.constants.EVENT_DEFAULT` in the second column. This is reserved to indicate dates that do not correspond to an event. + daily_event_neighbor_impact : `int`, `list` [`int`], callable or None, default None + The impact of neighboring timestamps of the events in ``event_df_dict``. + This is for daily events so the units below are all in days. + + For example, if the data is weekly ("W-SUN") and an event is daily, + it may not exactly fall on the weekly date. + But you can specify for New Year's day on 1/1, it affects all dates + in the week, e.g. 12/31, 1/1, ..., 1/6, then it will be mapped to the weekly date. + In this case you may want to map a daily event's date to a few dates, + and can specify + ``neighbor_impact=lambda x: [x-timedelta(days=x.isocalendar()[2]-1) + timedelta(days=i) for i in range(7)]``. + + Another example is that the data is rolling 7 day daily data, + thus a holiday may affect the t, t+1, ..., t+6 dates. + You can specify ``neighbor_impact=7``. + + If input is `int`, the mapping is t, t+1, ..., t+neighbor_impact-1. + If input is `list`, the mapping is [t+x for x in neighbor_impact]. + If input is a function, it maps each daily event's date to a list of dates. + daily_event_shifted_effect : `list` [`str`] or None, default None + Additional neighbor events based on given events. + For example, passing ["-1D", "7D"] will add extra daily events which are 1 day before + and 7 days after the given events. + Offset format is {d}{freq} with any integer plus a frequency string. + Must be parsable by pandas ``to_offset``. + The new events' names will be the current events' names with suffix "{offset}_before" or "{offset}_after". + For example, if we have an event named "US_Christmas Day", + a "7D" shift will have name "US_Christmas Day_7D_after". + This is useful when you expect an offset of the current holidays also has impact on the + time series, or you want to interact the lagged terms with autoregression. + If ``daily_event_neighbor_impact`` is also specified, this will be applied after adding neighboring days. fs_components_df : `pandas.DataFrame` or None, optional A dataframe with information about fourier series generation. Must contain columns with following names: @@ -495,7 +529,15 @@ def forecast( without any error being added and then add the error using the volatility model. The advantage is a major boost in speed during inference and the disadvantage is potentially less accurate prediction intervals. - + remove_intercept : `bool`, default False + Whether to remove explicit and implicit intercepts. + By default, `patsy` will make the design matrix always full rank. + It will always include an intercept term unless we specify "-1" or "+0". + However, if there are categorical variables, even we specify "-1" or "+0", + it will include an implicit intercept by adding all levels of a categorical + variable into the design matrix. + Sometimes we don't want this to happen. + Setting this parameter to True will remove both explicit and implicit intercepts. Returns ------- @@ -608,6 +650,8 @@ def forecast( The past dataframe used to generate AR terms. It includes the concatenation of ``past_df`` and ``df`` if ``past_df`` is provided, otherwise it is the ``df`` itself. + drop_intercept_col : `str` or None + The intercept column, explicit or implicit, to be dropped. """ df = df.copy() @@ -805,6 +849,8 @@ def forecast( time_col=time_col, origin_for_time_vars=origin_for_time_vars, daily_event_df_dict=daily_event_df_dict, + daily_event_neighbor_impact=daily_event_neighbor_impact, + daily_event_shifted_effect=daily_event_shifted_effect, changepoint_values=changepoint_values, continuous_time_col=continuous_time_col, growth_func=growth_func, @@ -966,7 +1012,8 @@ def forecast( max_admissible_value=max_admissible_value, uncertainty_dict=uncertainty_dict, normalize_method=normalize_method, - regression_weight_col=regression_weight_col) + regression_weight_col=regression_weight_col, + remove_intercept=remove_intercept) # Normalizes the changepoint_values normalized_changepoint_values = self.__normalize_changepoint_values( @@ -997,6 +1044,8 @@ def forecast( trained_model["lagged_regressor_cols"] = lagged_regressor_cols trained_model["normalize_method"] = normalize_method trained_model["daily_event_df_dict"] = daily_event_df_dict + trained_model["daily_event_neighbor_impact"] = daily_event_neighbor_impact + trained_model["daily_event_shifted_effect"] = daily_event_shifted_effect trained_model["changepoints_dict"] = changepoints_dict trained_model["changepoint_values"] = changepoint_values trained_model["normalized_changepoint_values"] = normalized_changepoint_values @@ -1136,6 +1185,8 @@ def predict_no_sim( time_col=trained_model["time_col"], origin_for_time_vars=trained_model["origin_for_time_vars"], daily_event_df_dict=trained_model["daily_event_df_dict"], + daily_event_neighbor_impact=trained_model["daily_event_neighbor_impact"], + daily_event_shifted_effect=trained_model["daily_event_shifted_effect"], changepoint_values=trained_model["changepoint_values"], continuous_time_col=trained_model["continuous_time_col"], growth_func=trained_model["growth_func"], @@ -1386,6 +1437,8 @@ def simulate( time_col=time_col, origin_for_time_vars=trained_model["origin_for_time_vars"], daily_event_df_dict=trained_model["daily_event_df_dict"], + daily_event_neighbor_impact=trained_model["daily_event_neighbor_impact"], + daily_event_shifted_effect=trained_model["daily_event_shifted_effect"], changepoint_values=trained_model["changepoint_values"], continuous_time_col=trained_model["continuous_time_col"], growth_func=trained_model["growth_func"], @@ -1441,9 +1494,20 @@ def simulate( f"However the std column ({ERR_STD_COL}) " "does not appear in the prediction") + # Here after assigning values for future forecast, we clip the values based on ``min_admissible_value`` and ``max_admissible_value`` + # saved in ``trained_model``. The clip should only function when error terms are added, as ``predict_no_sim`` ensures the predicted + # values (before errors are added) are bounded. + min_admissible_value = trained_model["min_admissible_value"] + max_admissible_value = trained_model["max_admissible_value"] + if min_admissible_value is not None or max_admissible_value is not None: + fut_df_sim.at[i, value_col] = np.clip( + a=fut_df_sim.at[i, value_col], + a_min=min_admissible_value, + a_max=max_admissible_value) + # we get the last prediction value and concat that to the end of # ``past_df`` - past_df_increment = fut_df_sub[[value_col]] + past_df_increment = fut_df_sim.iloc[[i]].reset_index(drop=True)[[value_col]] assert len(past_df_increment) == 1 if past_df_sim is None: past_df_sim = past_df_increment @@ -1548,6 +1612,8 @@ def simulate_multi( time_col=trained_model["time_col"], origin_for_time_vars=trained_model["origin_for_time_vars"], daily_event_df_dict=trained_model["daily_event_df_dict"], + daily_event_neighbor_impact=trained_model["daily_event_neighbor_impact"], + daily_event_shifted_effect=trained_model["daily_event_shifted_effect"], changepoint_values=trained_model["changepoint_values"], continuous_time_col=trained_model["continuous_time_col"], growth_func=trained_model["growth_func"], @@ -1721,7 +1787,8 @@ def quantile_summary(x): return { "fut_df": agg_df, - "x_mat": x_mat} + "x_mat": x_mat, + "sim_res": sim_res} def predict_via_sim_fast( self, @@ -2607,7 +2674,10 @@ def partition_fut_df( ignore_index=True, sort=False) # Imputes the missing values - fut_df_expanded = na_fill_func(fut_df_expanded) + # Excludes time column which doesn't need imputation, + # otherwise it causes error with pandas>=1.4. + fut_df_expanded.loc[:, fut_df_expanded.columns != time_col] = na_fill_func( + fut_df_expanded.loc[:, fut_df_expanded.columns != time_col]) index = ( [False] * fut_df_within_training.shape[0] + [True] * fut_df_gap.shape[0] + @@ -2661,6 +2731,8 @@ def __build_silverkite_features( time_col, origin_for_time_vars, daily_event_df_dict=None, + daily_event_neighbor_impact=None, + daily_event_shifted_effect=None, changepoint_values=None, continuous_time_col=None, growth_func=None, @@ -2756,6 +2828,37 @@ def __build_silverkite_features( Note: Do not use `~greykite.common.constants.EVENT_DEFAULT` in the second column. This is reserved to indicate dates that do not correspond to an event. + daily_event_neighbor_impact : `int`, `list` [`int`], callable or None, default None + The impact of neighboring timestamps of the events in ``event_df_dict``. + This is for daily events so the units below are all in days. + + For example, if the data is weekly ("W-SUN") and an event is daily, + it may not exactly fall on the weekly date. + But you can specify for New Year's day on 1/1, it affects all dates + in the week, e.g. 12/31, 1/1, ..., 1/6, then it will be mapped to the weekly date. + In this case you may want to map a daily event's date to a few dates, + and can specify + ``neighbor_impact=lambda x: [x-timedelta(days=x.isocalendar()[2]-1) + timedelta(days=i) for i in range(7)]``. + + Another example is that the data is rolling 7 day daily data, + thus a holiday may affect the t, t+1, ..., t+6 dates. + You can specify ``neighbor_impact=7``. + + If input is `int`, the mapping is t, t+1, ..., t+neighbor_impact-1. + If input is `list`, the mapping is [t+x for x in neighbor_impact]. + If input is a function, it maps each daily event's date to a list of dates. + daily_event_shifted_effect : `list` [`str`] or None, default None + Additional neighbor events based on given events. + For example, passing ["-1D", "7D"] will add extra daily events which are 1 day before + and 7 days after the given events. + Offset format is {d}{freq} with any integer plus a frequency string. + Must be parsable by pandas ``to_offset``. + The new events' names will be the current events' names with suffix "{offset}_before" or "{offset}_after". + For example, if we have an event named "US_Christmas Day", + a "7D" shift will have name "US_Christmas Day_7D_after". + This is useful when you expect an offset of the current holidays also has impact on the + time series, or you want to interact the lagged terms with autoregression. + If ``daily_event_neighbor_impact`` is also specified, this will be applied after adding neighboring days. changepoint_values : `list` of Union[int, float, double]], optional The values of the growth term at the changepoints Can be generated by the ``get_evenly_spaced_changepoints``, @@ -2805,7 +2908,9 @@ def __build_silverkite_features( features_df = add_daily_events( df=features_df, event_df_dict=daily_event_df_dict, - date_col="date") + date_col="date", + neighbor_impact=daily_event_neighbor_impact, + shifted_effect=daily_event_shifted_effect) # adds changepoints if changepoint_values is not None: diff --git a/greykite/algo/forecast/silverkite/forecast_simple_silverkite.py b/greykite/algo/forecast/silverkite/forecast_simple_silverkite.py index 415c937..e1be02b 100644 --- a/greykite/algo/forecast/silverkite/forecast_simple_silverkite.py +++ b/greykite/algo/forecast/silverkite/forecast_simple_silverkite.py @@ -90,6 +90,8 @@ def convert_params( holiday_post_num_days: int = 2, holiday_pre_post_num_dict: Optional[Dict] = None, daily_event_df_dict: Optional[Dict] = None, + daily_event_neighbor_impact: Optional[Union[int, List[int], callable]] = None, + daily_event_shifted_effect: Optional[List[str]] = None, auto_growth: bool = False, changepoints_dict: Optional[Dict] = None, auto_seasonality: bool = False, @@ -117,7 +119,8 @@ def convert_params( regression_weight_col: Optional[str] = None, simulation_based: Optional[bool] = False, simulation_num: int = 10, - fast_simulation: bool = False): + fast_simulation: bool = False, + remove_intercept: bool = False): """Converts parameters of :func:`~greykite.algo.forecast.silverkite.forecast_simple_silverkite` into those of :func:`~greykite.algo.forecast.forecast_silverkite.SilverkiteForecast::forecast`. @@ -352,6 +355,37 @@ def convert_params( Note: Do not use `~greykite.common.constants.EVENT_DEFAULT` in the second column. This is reserved to indicate dates that do not correspond to an event. + daily_event_neighbor_impact : `int`, `list` [`int`], callable or None, default None + The impact of neighboring timestamps of the events in ``event_df_dict``. + This is for daily events so the units below are all in days. + + For example, if the data is weekly ("W-SUN") and an event is daily, + it may not exactly fall on the weekly date. + But you can specify for New Year's day on 1/1, it affects all dates + in the week, e.g. 12/31, 1/1, ..., 1/6, then it will be mapped to the weekly date. + In this case you may want to map a daily event's date to a few dates, + and can specify + ``neighbor_impact=lambda x: [x-timedelta(days=x.isocalendar()[2]-1) + timedelta(days=i) for i in range(7)]``. + + Another example is that the data is rolling 7 day daily data, + thus a holiday may affect the t, t+1, ..., t+6 dates. + You can specify ``neighbor_impact=7``. + + If input is `int`, the mapping is t, t+1, ..., t+neighbor_impact-1. + If input is `list`, the mapping is [t+x for x in neighbor_impact]. + If input is a function, it maps each daily event's date to a list of dates. + daily_event_shifted_effect : `list` [`str`] or None, default None + Additional neighbor events based on given events. + For example, passing ["-1D", "7D"] will add extra daily events which are 1 day before + and 7 days after the given events. + Offset format is {d}{freq} with any integer plus a frequency string. + Must be parsable by pandas ``to_offset``. + The new events' names will be the current events' names with suffix "{offset}_before" or "{offset}_after". + For example, if we have an event named "US_Christmas Day", + a "7D" shift will have name "US_Christmas Day_7D_after". + This is useful when you expect an offset of the current holidays also has impact on the + time series, or you want to interact the lagged terms with autoregression. + If ``daily_event_neighbor_impact`` is also specified, this will be applied after adding neighboring days. auto_growth : `bool`, default False Whether to automatically infer growth configuration. If True, the growth term and automatically changepoint detection configuration @@ -609,6 +643,15 @@ def convert_params( without any error being added and then add the error using the volatility model. The advantage is a major boost in speed during inference and the disadvantage is potentially less accurate prediction intervals. + remove_intercept : `bool`, default False + Whether to remove explicit and implicit intercepts. + By default, `patsy` will make the design matrix always full rank. + It will always include an intercept term unless we specify "-1" or "+0". + However, if there are categorical variables, even we specify "-1" or "+0", + it will include an implicit intercept by adding all levels of a categorical + variable into the design matrix. + Sometimes we don't want this to happen. + Setting this parameter to True will remove both explicit and implicit intercepts. Returns @@ -720,7 +763,10 @@ def convert_params( # Sets empty dictionary to None daily_event_df_dict = None - extra_pred_cols += get_event_pred_cols(daily_event_df_dict) + extra_pred_cols += get_event_pred_cols( + daily_event_df_dict, + daily_event_shifted_effect + ) # Specifies ``extra_pred_cols`` (interactions and additional model terms). # Seasonality interaction order is limited by the available order and max requested. @@ -787,6 +833,8 @@ def convert_params( fit_algorithm=fit_algorithm, # pass-through fit_algorithm_params=fit_algorithm_params, # pass-through daily_event_df_dict=daily_event_df_dict, + daily_event_neighbor_impact=daily_event_neighbor_impact, # pass-through + daily_event_shifted_effect=daily_event_shifted_effect, # pass-through fs_components_df=fs_components_df, autoreg_dict=autoreg_dict, # pass-through past_df=past_df, # pass-through @@ -802,7 +850,8 @@ def convert_params( forecast_horizon=forecast_horizon, # pass-through simulation_based=simulation_based, # pass-through simulation_num=simulation_num, # pass-through - fast_simulation=fast_simulation # pass-through + fast_simulation=fast_simulation, # pass-through + remove_intercept=remove_intercept # pass-through ) return parameters @@ -845,13 +894,13 @@ def __get_requested_seasonality_order( Parameters ---------- - requested_seasonality : `str` or `bool` or `int`, default = 'auto' + requested_seasonality : `str` or `bool` or `int` or None, default = "auto" The requested seasonality. - 'auto', True, False, or a number for the Fourier order. + "auto", `True`, `False`, None (same as `False`) or a number for the Fourier order. default_order : `int` - The default order to use for 'auto' and True. + The default order to use for "auto" and True. is_enabled_auto : `bool` - Whether the seasonality should be modeled for 'auto' seasonality. + Whether the seasonality should be modeled for "auto" seasonality. Returns ------- @@ -862,12 +911,14 @@ def __get_requested_seasonality_order( order = default_order elif requested_seasonality is False or (requested_seasonality == 'auto' and not is_enabled_auto): order = 0 + elif requested_seasonality is None: + order = 0 else: try: order = int(requested_seasonality) except ValueError as e: log_message(f"Requested seasonality order '{requested_seasonality}' must be one of:" - f" 'auto', True, False, integer", LoggingLevelEnum.ERROR) + f" 'auto', True, False, None, or int.", LoggingLevelEnum.ERROR) raise e return order diff --git a/greykite/algo/forecast/silverkite/forecast_simple_silverkite_helper.py b/greykite/algo/forecast/silverkite/forecast_simple_silverkite_helper.py index a40b591..1521b14 100644 --- a/greykite/algo/forecast/silverkite/forecast_simple_silverkite_helper.py +++ b/greykite/algo/forecast/silverkite/forecast_simple_silverkite_helper.py @@ -34,6 +34,7 @@ from greykite.common.features.timeseries_features import add_event_window_multi from greykite.common.features.timeseries_features import get_fourier_col_name from greykite.common.features.timeseries_features import get_holidays +from greykite.common.python_utils import split_offset_str def cols_interact( @@ -264,7 +265,9 @@ def patsy_categorical_term( return string -def get_event_pred_cols(daily_event_df_dict): +def get_event_pred_cols( + daily_event_df_dict, + daily_event_shifted_effect=None): """Generates the names of internal predictor columns from the event dictionary passed to `~greykite.algo.forecast.silverkite.forecast_silverkite.SilverkiteForecast.forecast`. @@ -285,6 +288,19 @@ def get_event_pred_cols(daily_event_df_dict): daily_event_df_dict : `dict` or None, optional, default None A dictionary of data frames, each representing events data for the corresponding key. See `~greykite.algo.forecast.silverkite.forecast_silverkite.SilverkiteForecast.forecast`. + daily_event_shifted_effect : `list` [`str`] or None, default None + Additional neighbor events based on given events. + For example, passing ["-1D", "7D"] will add extra daily events which are 1 day before + and 7 days after the given events. + Offset format is {d}{freq} with any integer plus a frequency string. + Must be parsable by pandas ``to_offset``. + The new events' names will be the current events' names with suffix "{offset}_before" or "{offset}_after". + For example, if we have an event named "US_Christmas Day", + a "7D" shift will have name "US_Christmas Day_7D_after". + This is useful when you expect an offset of the current holidays also has impact on the + time series, or you want to interact the lagged terms with autoregression. + The interaction can be specified with e.g. ``y_lag7:events_US_Christmas Day_7D_after``. + If ``daily_event_neighbor_impact`` is also specified, this will be applied after adding neighboring days. Returns ------- @@ -302,4 +318,17 @@ def get_event_pred_cols(daily_event_df_dict): event_levels = [cst.EVENT_DEFAULT] # reference level for non-event days event_levels += list(daily_event_df_dict[key][cst.EVENT_DF_LABEL_COL].unique()) # this event's levels event_pred_cols += [patsy_categorical_term(term=term, levels=event_levels)] + # Adds columns for additional neighbor events. + # Does the above for each additional lagged event. + if daily_event_shifted_effect is not None: + for lag in daily_event_shifted_effect: + num, freq = split_offset_str(lag) + num = int(num) + suffix = cst.EVENT_SHIFTED_SUFFIX_BEFORE if num < 0 else cst.EVENT_SHIFTED_SUFFIX_AFTER + term = f"{cst.EVENT_PREFIX}_{key}_{abs(num)}{freq}{suffix}" + event_levels = [cst.EVENT_DEFAULT] + levels = list(daily_event_df_dict[key][cst.EVENT_DF_LABEL_COL].unique()) + levels = [f"{level}_{abs(num)}{freq}{suffix}" for level in levels] + event_levels += levels + event_pred_cols += [patsy_categorical_term(term=term, levels=event_levels)] return event_pred_cols diff --git a/greykite/algo/uncertainty/conditional/conf_interval.py b/greykite/algo/uncertainty/conditional/conf_interval.py index 1d4c56d..4235c24 100644 --- a/greykite/algo/uncertainty/conditional/conf_interval.py +++ b/greykite/algo/uncertainty/conditional/conf_interval.py @@ -18,7 +18,7 @@ # ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS # SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -# original author: Reza Hosseini, Sayan Patra +# original author: Reza Hosseini, Sayan Patra, Yi Su """Calculates uncertainty intervals from the conditional empirical distribution of the residual. """ @@ -27,6 +27,7 @@ import numpy as np import pandas as pd +import scipy from greykite.algo.uncertainty.conditional.dataframe_utils import limit_tuple_col from greykite.algo.uncertainty.conditional.dataframe_utils import offset_tuple_col @@ -40,6 +41,9 @@ def conf_interval( df, distribution_col, offset_col=None, + sigma_scaler=None, + h_mat=None, + x_mean=None, conditional_cols=None, quantiles=(0.005, 0.025, 0.975, 0.995), quantile_estimation_method="normal_fit", @@ -79,6 +83,22 @@ def conf_interval( The column containing the values by which the computed quantiles for ``distribution_col`` are shifted. Only used during prediction phase. If None, quantiles are not shifted. + sigma_scaler : `float` or None, default None + Scaling factor that is applied to the estimated standard deviation ``sigma`` in regression setting. + Used to take into account the degrees of freedom in the fitted model, otherwise + `sigma` is under-estimated by just using the distribution of the residuals. + The formula is ``sigma_scaler = np.sqrt((n_train - 1) / (n_train - p_effective))``. + Only useful in linear and ridge regression models. + If `None`, no scaling will be done. + h_mat : `np.ndarray` or None, default None + The H matrix ``np.linalg.pinv(X.T @ X + alpha * np.eye(p)) @ X.T`` in regression setting. + Dimension is ``p`` (number of parameters) by ``n_train``, and + ``alpha`` is the regularization term extracted from ``ml_model``. + See `~greykite.algo.common.ml_models.fit_ml_model` for details. + x_mean : `np.ndarray` or None, default None + Column mean of ``x_mat`` as a row vector. + This is stored and used in ridge regression to compute the prediction intervals. + In other methods, it is set to `None`. conditional_cols : `list` [`str`] or None, default None These columns are used to slice the data first then calculate quantiles for each slice. @@ -168,8 +188,7 @@ def conf_interval( quantile_grid_size=None, quantiles=quantiles, conditional_cols=conditional_cols, - remove_conditional_mean=True - ) + remove_conditional_mean=True) ecdf_df = model_dict["ecdf_df"] ecdf_df_overall = model_dict["ecdf_df_overall"] @@ -184,25 +203,23 @@ def conf_interval( mean_col=None, fixed_mean=0.0, quantiles=quantiles, - quantile_summary_col=quantile_summary_col - ) + quantile_summary_col=quantile_summary_col) ecdf_df_fallback = normal_quantiles_df( df=ecdf_df_overall, std_col=std_col, mean_col=None, fixed_mean=0.0, quantiles=quantiles, - quantile_summary_col=quantile_summary_col - ) + quantile_summary_col=quantile_summary_col) else: raise NotImplementedError( f"CI calculation method {quantile_estimation_method} is not either of: normal_fit; ecdf") - # handling slices with small sample size - # if a method is provided via the argument "small_sample_size_method" then it is used here - # the idea is to take a relatively high volatility - # when the new point does not have enough (as specified by "sample_size_thresh") - # similar points in the past + # Handles slices with small sample size. + # If a method is provided via the argument `small_sample_size_method`, then it is used here. + # The idea is to take a relatively high volatility + # when the new point does not have enough (as specified by `sample_size_thresh`) + # similar points in the past. fall_back_for_all = False if small_sample_size_method == "std_quantiles": ecdf_df_large_ss = ecdf_df.loc[ecdf_df[sample_size_col] >= sample_size_thresh].reset_index(drop=True) @@ -222,7 +239,8 @@ def conf_interval( ecdf_df_large_ss["std_quantile_diff"] = abs(ecdf_df_large_ss["std_quantile"] - small_sample_size_quantile) # Chooses the row with closes value in "std_quantile" column to ``small_sample_size_quantile`` # Note the resulting dataframe below ``ecdf_df_fallback`` will have one row - ecdf_df_fallback = ecdf_df_large_ss.loc[[ecdf_df_large_ss["std_quantile_diff"].idxmin()]] + ecdf_df_fallback = (ecdf_df_large_ss.loc[[ecdf_df_large_ss["std_quantile_diff"].idxmin()]] + .reset_index(drop=True)) del ecdf_df_fallback["std_quantile"] del ecdf_df_fallback["std_quantile_diff"] del ecdf_df_large_ss["std_quantile"] @@ -235,7 +253,35 @@ def conf_interval( raise NotImplementedError( f"small_sample_size_method {small_sample_size_method} is not implemented.") - return { + # Pre-calculates the quantities needed in `predict_ci`. + lu_d_sqrt = None + """The L matrix (p by p) decomposed from `H @ H.T`, where `H = inv(X.T @ X + alpha * np.eye(p)) @ X.T`. + We decompose `H @ H.T` into `L @ L.T` where L is a square p by p matrix. + This matrix is pre-calculated and stored in the trained model for fast inference in a later step. + Cholesky decomposition does not apply because H is not a full-rank matrix, albeit positive semi-definite. + We could use eigenvalue decomposition (`numpy.linalg.eigh`) or the LDL decomposition (`scipy.linalg.ldl`) + of a Hermitian matrix for this purpose. + """ + n_train = None + if h_mat is not None: + n_train = h_mat.shape[1] + # If `h_mat` (H) is obtained successfully, computes `lu_d_sqrt` (L, p x p) + # s.t. `h_mat = lu_d_sqrt @ lu_d_sqrt.T`. + hht_mat = h_mat @ h_mat.T + lu, d, _ = scipy.linalg.ldl(hht_mat) # LDL decomposition. + d[d < 0] = 0 # Due to floating precision issue, there could be near-zero negative eigenvalues. + lu_d_sqrt = lu @ (d ** 0.5) + # Asserts a check for the approximation. If it fails, falls back to the original `h_mat`. + tolerance = 1e-8 + relative_err = np.linalg.norm(lu_d_sqrt @ lu_d_sqrt.T - hht_mat) / np.linalg.norm(hht_mat) + # Note that if adding `assert relative_err < 1e-12`, all unit tests still passed. + if np.isnan(lu_d_sqrt).any() or relative_err > tolerance: + warnings.warn(f"Re-constructing `h_mat @ h_mat.T` by `lu_d_sqrt @ lu_d_sqrt.T` has a bigger relative error " + f"{relative_err} than tolerance {tolerance}. Falling back to `h_mat` for more " + f"accurate variance estimation.") + lu_d_sqrt = h_mat + + uncertainty_model = { "ecdf_df": ecdf_df, "ecdf_df_overall": ecdf_df_overall, "ecdf_df_fallback": ecdf_df_fallback, @@ -247,12 +293,41 @@ def conf_interval( "conditional_cols": conditional_cols, "std_col": std_col, "quantile_summary_col": quantile_summary_col, - "fall_back_for_all": fall_back_for_all} + "fall_back_for_all": fall_back_for_all, + "sigma_scaler": sigma_scaler, + "lu_d_sqrt": lu_d_sqrt, + "n_train": n_train, + "x_train_mean": x_mean} + + # Scales `std_col` and `quantile_summary_col` columns in returned quantile dataframes: + # (1) `ecdf_df` (2) `ecdf_df_fallback`. + def scale_std_quantile_summary_inplace(df, sigma_scaler): + """Scales ``std_col`` and ``quantile_summary_col`` of ``df`` by ``sigma_scaler`` in-place.""" + if sigma_scaler is None: + sigma_scaler = 1 + df[std_col] *= sigma_scaler + # Since values in `quantile_summary_col` are tuples, we need to convert it to `np.array` first. + quantile_summary = df[quantile_summary_col].apply(lambda value_tuple: np.array(value_tuple)) + quantile_summary *= sigma_scaler + df[quantile_summary_col] = quantile_summary.apply(lambda array: tuple(array)) + + if sigma_scaler is not None: + ecdf_df_original = ecdf_df.copy() + scale_std_quantile_summary_inplace(ecdf_df, sigma_scaler) + ecdf_df_fallback_original = ecdf_df_fallback.copy() + scale_std_quantile_summary_inplace(ecdf_df_fallback, sigma_scaler) + # Adds the original dataframes to `uncertainty_model`. + uncertainty_model.update({ + "ecdf_df_original": ecdf_df_original, + "ecdf_df_fallback_original": ecdf_df_fallback_original}) + + return uncertainty_model def predict_ci( new_df, - ci_model): + ci_model, + x_mat=None): """Predicts the quantiles of the ``offset_col`` (defined in ``ci_model``) in ``new_df``. Parameters @@ -262,6 +337,10 @@ def predict_ci( and ``conditional_cols`` defined in the ``ci_model``. ci_model : `dict` Returned CI model from ``conf_interval``. + x_mat : `np.ndarray` or None, default None + The design matrix from the model fitted on ``new_df``. + ``x_mat.shape[0]`` must match that of ``new_df``, and + ``x_mat.shape[1]`` must match the number of features in ``ml_model`` where ``ci_model`` is fitted. Returns ------- @@ -279,6 +358,47 @@ def predict_ci( std_col = ci_model["std_col"] quantile_summary_col = ci_model["quantile_summary_col"] fall_back_for_all = ci_model["fall_back_for_all"] + lu_d_sqrt = ci_model["lu_d_sqrt"] + n_train = ci_model["n_train"] + x_mean = ci_model["x_train_mean"] + + # Computes the scaling factor for prediction interval standard errors. + # We will use `pi_se_scaler` to scale `quantile_summary_col` and `std_col` columns. + pi_se_scaler = 1 + # If `x_mat` is provided in case of linear or ridge regression, + # we know the closed form expression of the prediction intervals. + # Therefore, we need to scale the prediction intervals, regardless of train / test. + if x_mat is not None and lu_d_sqrt is not None: + assert new_df.shape[0] == x_mat.shape[0], "In `predict_ci`, `new_df` need to be the same length as `x_mat`." + # `h_mat @ h_mat.T = lu_d_sqrt @ lu_d_sqrt.T`, where `lu_d_sqrt` is p by p and is pre-calculated. + assert lu_d_sqrt.shape[0] == x_mat.shape[1], "Feature dimension p must match `lu_d_sqrt` and `x_mat`." + n_pred, p = x_mat.shape + X_pred = np.array(x_mat).reshape((n_pred, p)) + # Variance from coefficients contains 3 terms. + # The first term is true for both linear and ridge regression, while the other two terms are for ridge only. + # Let's use `L @ L.T` to replace `H @ H.T`, + # and denote `x_mean_mat = np.repeat(x_mean, repeats=n_pred, axis=0)` (`n_pred` by `p`), + # then we have the following formulas. + # Term 1: `X_pred @ L @ L.T @ X_pred.T`. + # Term 2: `x_mean_mat @ L @ L.T @ X_pred.T` plus its transpose. + # Term 3: `(1 / n_train + x_mean @ L @ L.T @ x_mean.T) * np.ones((n_pred, n_pred))`. + # (1) In ridge case, three terms can be simplified to a quadratic form plus a constant matrix: + # `(X_pred - x_mean_mat) @ L @ L.T @ (X_pred - x_mean_mat).T + (1 / n_train) * np.ones((n_pred, n_pred))`. + # (2) In linear case, we only keep term 1: `X_pred @ L @ L.T @ X_pred.T`. + # Since we only need the diagonal elements, they can be obtained in the following ways. + # Define `A = L.T @ X_pred.T` (`p` by `n_pred`), then the diagonal of `A.T @ A` is equivalent + # to `(A ** 2).sum(axis=0)`, length `n_pred`, which is much more efficient to compute. + if x_mean is not None: # Ridge. + # In case of ridge regression, we use `x_mean` to center `X_pred` first. + X_pred = X_pred - x_mean + A = lu_d_sqrt.T @ X_pred.T # `p` by `n_pred`. + # We are ready to compute the variance scaler, + # The last term of 1 below is from `np.ones((n_pred,))`, assuming i.i.d. errors with constant variance. + if x_mean is not None: # Ridge. + pi_variance = (A ** 2).sum(axis=0) + (1 / n_train + 1) + else: # Linear. + pi_variance = (A ** 2).sum(axis=0) + 1 + pi_se_scaler = np.sqrt(pi_variance) # Copies ``pred_df`` so that input df to predict is not altered pred_df = new_df.reset_index(drop=True) @@ -303,14 +423,23 @@ def predict_ci( how="left") # When we have missing in the grouped case (which can happen if a level - # in ``match_cols`` didn't appear in train dataset) - # we fall back to the overall case + # in ``conditional_cols`` didn't appear in train dataset), + # we fall back to the overall case. for col in [quantile_summary_col, std_col]: na_index = pred_df_conditional[col].isnull() pred_df_conditional.loc[na_index, col] = ( pred_df_fallback.loc[na_index, col]) - # offsetting the values in ``distribution_col`` by ``offset_col`` + # Before offsetting, applies `pi_se_scaler` to all rows in `quantile_summary_col` and `std_col` columns. + # This scaling only happens in uncertainty model's predict phase (`predict_ci`). + pred_df_conditional[std_col] *= pi_se_scaler + # Since values in `quantile_summary_col` are tuples, we need to convert it to `np.array` first. + quantile_summary = pred_df_conditional[quantile_summary_col].apply(lambda value_tuple: np.array(value_tuple)) + quantile_summary *= pi_se_scaler + pred_df_conditional[quantile_summary_col] = quantile_summary.apply(lambda array: tuple(array)) + ci_model["pi_se_scaler"] = pi_se_scaler + + # Offsets the values in `distribution_col` by `offset_col`. if offset_col is None: pred_df_conditional[QUANTILE_SUMMARY_COL] = pred_df_conditional[quantile_summary_col] else: @@ -325,12 +454,13 @@ def predict_ci( lower=min_admissible_value, upper=max_admissible_value) - # Only returning needed cols + # Only returns needed cols. returned_cols = [QUANTILE_SUMMARY_COL, std_col] if conditional_cols is not None: returned_cols = conditional_cols + returned_cols pred_df[returned_cols] = pred_df_conditional[returned_cols] + # Standardizes `std_col` column name. pred_df.rename(columns={ std_col: ERR_STD_COL }, inplace=True) diff --git a/greykite/common/aggregation_function_enum.py b/greykite/common/aggregation_function_enum.py index cc8f717..818df1f 100644 --- a/greykite/common/aggregation_function_enum.py +++ b/greykite/common/aggregation_function_enum.py @@ -38,4 +38,5 @@ class AggregationFunctionEnum(Enum): nanmean = partial(np.nanmean) maximum = partial(np.max) minimum = partial(np.min) + sum = partial(np.sum) weighted_average = partial(np.average) diff --git a/greykite/common/constants.py b/greykite/common/constants.py index a2fc26c..2ba94a9 100644 --- a/greykite/common/constants.py +++ b/greykite/common/constants.py @@ -30,76 +30,92 @@ # The time series data is represented in pandas dataframes # The default column names for the series are given below TIME_COL = "ts" -"""The default name for the column with the timestamps of the time series""" +"""The default name for the column with the timestamps of the time series.""" VALUE_COL = "y" -"""The default name for the column with the values of the time series""" +"""The default name for the column with the values of the time series.""" ACTUAL_COL = "actual" -"""The column name representing actual (observed) values""" +"""The column name representing actual (observed) values.""" PREDICTED_COL = "forecast" -"""The column name representing the predicted values""" +"""The column name representing the predicted values.""" RESIDUAL_COL = "residual" """The column name representing the forecast residuals.""" PREDICTED_LOWER_COL = "forecast_lower" -"""The column name representing upper bounds of prediction interval""" +"""The column name representing lower bounds of prediction interval.""" PREDICTED_UPPER_COL = "forecast_upper" -"""The column name representing lower bounds of prediction interval""" +"""The column name representing upper bounds of prediction interval.""" NULL_PREDICTED_COL = "forecast_null" -"""The column name representing predicted values from null model""" +"""The column name representing predicted values from null model.""" ERR_STD_COL = "err_std" -"""The column name representing the error standard deviation from models""" +"""The column name representing the error standard deviation from models.""" QUANTILE_SUMMARY_COL = "quantile_summary" -"""The column name representing the quantile summary from models""" +"""The column name representing the quantile summary from models.""" -# Evaluation metrics corresponding to `~greykite.common.evaluation` +# Evaluation metrics corresponding to `~greykite.common.evaluation`. R2_null_model_score = "R2_null_model_score" """Evaluation metric. Improvement in the specified loss function compared to the predictions of a null model.""" FRACTION_OUTSIDE_TOLERANCE = "Outside Tolerance (fraction)" -"""Evaluation metric. The fraction of predictions outside the specified tolerance level""" +"""Evaluation metric. The fraction of predictions outside the specified tolerance level.""" PREDICTION_BAND_WIDTH = "Prediction Band Width (%)" -"""Evaluation metric. Relative size of prediction bands vs actual, as a percent""" +"""Evaluation metric. Relative size of prediction bands vs actual, as a percent.""" PREDICTION_BAND_COVERAGE = "Prediction Band Coverage (fraction)" -"""Evaluation metric. Fraction of observations within the bands""" +"""Evaluation metric. Fraction of observations within the bands.""" LOWER_BAND_COVERAGE = "Coverage: Lower Band" -"""Evaluation metric. Fraction of observations within the lower band""" +"""Evaluation metric. Fraction of observations within the lower band.""" UPPER_BAND_COVERAGE = "Coverage: Upper Band" -"""Evaluation metric. Fraction of observations within the upper band""" +"""Evaluation metric. Fraction of observations within the upper band.""" COVERAGE_VS_INTENDED_DIFF = "Coverage Diff: Actual_Coverage - Intended_Coverage" -"""Evaluation metric. Difference between actual and intended coverage""" +"""Evaluation metric. Difference between actual and intended coverage.""" -# Column names used by `~greykite.common.features.timeseries_features` +# Column names used by `~greykite.common.features.timeseries_features`. EVENT_DF_DATE_COL = "date" -"""Name of date column for the DataFrames passed to silverkite `custom_daily_event_df_dict`""" +"""Name of date column for the DataFrames passed to silverkite `custom_daily_event_df_dict`.""" EVENT_DF_LABEL_COL = "event_name" -"""Name of event column for the DataFrames passed to silverkite `custom_daily_event_df_dict`""" +"""Name of event column for the DataFrames passed to silverkite `custom_daily_event_df_dict`.""" EVENT_PREFIX = "events" """Prefix for naming event features.""" EVENT_DEFAULT = "" """Label used for days without an event.""" EVENT_INDICATOR = "event" -"""Binary indicatory for an event""" +"""Binary indicator for an event.""" +IS_EVENT_COL = "is_event" +"""Indicator column in feature matrix, 1 if the day is an event or its neighboring days.""" +IS_EVENT_ADJACENT_COL = "is_event_adjacent" +"""Indicator column in feature matrix, 1 if the day is adjacent to an event.""" +IS_EVENT_EXACT_COL = "is_event_exact" +"""Indicator column in feature matrix, 1 if the day is an event but not its neighboring days.""" +EVENT_SHIFTED_SUFFIX_BEFORE = "_before" +"""The suffix for neighboring events before the events added to the event names.""" +EVENT_SHIFTED_SUFFIX_AFTER = "_after" +"""The suffix for neighboring events after the events added to the event names.""" CHANGEPOINT_COL_PREFIX = "changepoint" """Prefix for naming changepoint features.""" CHANGEPOINT_COL_PREFIX_SHORT = "cp" """Short prefix for naming changepoint features.""" # Column names used by -# `~greykite.common.features.adjust_anomalous_data.adjust_anomalous_data` +# `~greykite.common.features.adjust_anomalous_data.adjust_anomalous_data`. START_TIME_COL = "start_time" -"""Start timestamp column name""" +"""Default column name for anomaly start time in the anomaly dataframe.""" END_TIME_COL = "end_time" -"""Standard end timestamp column""" +"""Default column name for anomaly end time in the anomaly dataframe.""" ADJUSTMENT_DELTA_COL = "adjustment_delta" -"""Adjustment column""" +"""Default column name for anomaly adjustment in the anomaly dataframe.""" METRIC_COL = "metric" -"""Column to denote metric of interest""" +"""Column to denote metric of interest.""" DIMENSION_COL = "dimension" -"""Dimension column""" +"""Dimension column.""" ANOMALY_COL = "is_anomaly" -"""The default name for the column with the anomaly labels of the time series""" +"""Default column name for anomaly labels (boolean) in the time series.""" +PREDICTED_ANOMALY_COL = "is_anomaly_predicted" +"""Default column name for predicted anomaly labels (boolean) in the time series.""" - -# Constants related to -# `~greykite.common.features.timeseries_features.build_time_features_df`. +# Column names used in anomaly dataframe during anomaly detection. +SEVERITY_SCORE_COL = "severity_score" +"""Default column name for anomaly severity score in the anomaly dataframe.""" +USER_REVIEWED_COL = "is_user_reviewed" +"""Default column name for whether an anomaly is reviewed by the user (boolean) in the anomaly dataframe.""" +NEW_PATTERN_ANOMALY_COL = "new_pattern_anomaly" +"""Default column name for whether an anomaly is a new pattern (boolean) in the anomaly dataframe.""" class TimeFeaturesEnum(Enum): @@ -158,6 +174,8 @@ class TimeFeaturesEnum(Enum): ct3 = "ct3" ct_sqrt = "ct_sqrt" ct_root3 = "ct_root3" + us_dst = "us_dst" + eu_dst = "eu_dst" class GrowthColEnum(Enum): @@ -177,9 +195,9 @@ class GrowthColEnum(Enum): # Column names used by # `~greykite.common.features.timeseries_lags` LAG_INFIX = "_lag" -"""Infix for lagged feature names""" +"""Infix for lagged feature names.""" AGG_LAG_INFIX = "avglag" -"""Infix for aggregated lag feature names""" +"""Infix for aggregated lag feature names.""" # Patterns for categorizing timeseries features TREND_REGEX = f"{CHANGEPOINT_COL_PREFIX}\\d|ct\\d|ct_|{CHANGEPOINT_COL_PREFIX_SHORT}\\d" @@ -193,3 +211,25 @@ class GrowthColEnum(Enum): LOGGER_NAME = "Greykite" """Name used by the logger.""" + +# Default regex dictionary for component plots +DEFAULT_COMPONENTS_REGEX_DICT = { + "Regressors": ".*regressor.|regressor", + "Autoregressive": ".*_lag.|.*avglag.", + "Event": f".*{EVENT_REGEX}.", + "Seasonality": f".*tod.|.*tow.|.*dow.|.*is_weekend.|.*tom.|.*month.|.*toq.|.*quarter.|.*toy.|.*year.|.*yearly", + "Trend": TREND_REGEX, +} + +# Detailed seasonality regex dictionary for component plots +DETAILED_SEASONALITY_COMPONENTS_REGEX_DICT = { + "Regressors": ".*regressor.|regressor", + "Autoregressive": ".*_lag.|.*avglag.", + "Event": f".*{EVENT_REGEX}.", + "Daily": f".*tod.", + "Weekly": f".*tow.|.*dow.|.*is_weekend.", + "Monthly": f".*tom.|.*month.", + "Quarterly": f".*toq.|.*quarter.", + "Yearly": f".*toy.|.*year.|.*yearly", + "Trend": TREND_REGEX, +} diff --git a/greykite/common/data_loader.py b/greykite/common/data_loader.py index f563cbf..03ba695 100644 --- a/greykite/common/data_loader.py +++ b/greykite/common/data_loader.py @@ -106,7 +106,7 @@ def get_aggregated_data(df, agg_freq=None, agg_func=None): agg_freq : `str` or None, default None If None, data will not be aggregated and will include all columns. - Possible values: "daily", "weekly", or "monthly". + Possible values: "hourly", "daily", "weekly", or "monthly". agg_func : `Dict` [`str`, `str`], default None A dictionary of the columns to be aggregated and the corresponding aggregating functions. @@ -127,7 +127,12 @@ def get_aggregated_data(df, agg_freq=None, agg_func=None): elif agg_freq and agg_func: df_raw = df[list(agg_func.keys())] df_raw.insert(0, TIME_COL, pd.to_datetime(df[TIME_COL])) - if agg_freq == "daily": + if agg_freq == "hourly": + # Aggregate to hourly + df_tmp = df_raw.resample("H", on=TIME_COL).agg(agg_func) + df_hourly = df_tmp.drop(columns=TIME_COL).reset_index() if TIME_COL in df_tmp.columns else df_tmp.reset_index() + return df_hourly + elif agg_freq == "daily": # Aggregate to daily df_tmp = df_raw.resample("D", on=TIME_COL).agg(agg_func) df_daily = df_tmp.drop(columns=TIME_COL).reset_index() if TIME_COL in df_tmp.columns else df_tmp.reset_index() @@ -143,7 +148,7 @@ def get_aggregated_data(df, agg_freq=None, agg_func=None): df_monthly = df_tmp.drop(columns=TIME_COL).reset_index() if TIME_COL in df_tmp.columns else df_tmp.reset_index() return df_monthly else: - warnings.warn("Invalid \"agg_freq\", must be one of \"daily\", \"weekly\" or \"monthly\". " + warnings.warn("Invalid \"agg_freq\", must be one of \"hourly\", \"daily\", \"weekly\" or \"monthly\". " "Non-aggregated data is returned.") return df_raw else: diff --git a/greykite/common/evaluation.py b/greykite/common/evaluation.py index 7b0c945..ac21e79 100644 --- a/greykite/common/evaluation.py +++ b/greykite/common/evaluation.py @@ -603,6 +603,50 @@ def prediction_band_width(observed, lower, upper): return 100.0 * np.mean(np.abs(upper - lower) / observed) if len(observed) > 0 and observed.min() > 0 else None +def mean_interval_score(observed, lower, upper, coverage): + """Calculates the mean interval score. + If an observed value falls within the interval, the score is simply the width of the interval. + If an observed value falls outside the interval, the score is the width of the interval plus an error term + proportional to distance between the actual and its closest interval boundary. + The proportionality constant is 2.0 / (1.0 - `coverage`). + See `Strictly Proper Scoring Rules, Prediction, and Estimation, Tilmann Gneiting and Adrian E. Raftery, 2007, + Journal of the American Statistical Association, Volume 102, 2007 - Issue 477`. + + Parameters + ---------- + observed: `pandas.Series` or `numpy.array` + Numeric, observed values. + lower: `pandas.Series` or `numpy.array` + Numeric, lower bound. + upper: `pandas.Series` or `numpy.array` + Numeric, upper bound. + coverage: `float` + Intended coverage of the prediction bands (0.0 to 1.0) + + Returns + ------- + mean_interval_score: `float` + The mean interval score. + """ + observed, lower, upper = valid_elements_for_evaluation( + reference_arrays=[observed], + arrays=[lower, upper], + reference_array_names="y_true", + drop_leading_only=False, + keep_inf=False) + lower, upper, observed = valid_elements_for_evaluation( + reference_arrays=[lower, upper], + arrays=[observed], + reference_array_names="lower/upper bounds", + drop_leading_only=False, + keep_inf=True) + interval_width = upper - lower + error_lower = np.where(observed < lower, 2.0 * (lower - observed) / (1.0 - coverage), 0.0) + error_upper = np.where(observed > upper, 2.0 * (observed - upper) / (1.0 - coverage), 0.0) + interval_score = interval_width + error_lower + error_upper + return np.mean(interval_score) + + def calc_pred_coverage(observed, predicted, lower, upper, coverage): """Calculates the prediction coverages: prediction band width, prediction band coverage etc. @@ -629,21 +673,27 @@ def calc_pred_coverage(observed, predicted, lower, upper, coverage): metrics = {} if len(observed) > 0: - # relative size of prediction bands vs actual, as a percent + # Relative size of prediction bands vs actual, as a percent. enum = ValidationMetricEnum.BAND_WIDTH metric_func = enum.get_metric_func() metrics.update({PREDICTION_BAND_WIDTH: metric_func(observed, lower, upper)}) enum = ValidationMetricEnum.BAND_COVERAGE metric_func = enum.get_metric_func() - # fraction of observations within the bands + # Fraction of observations within the bands. metrics.update({PREDICTION_BAND_COVERAGE: metric_func(observed, lower, upper)}) - # fraction of observations within the lower band + # Fraction of observations within the lower band. metrics.update({LOWER_BAND_COVERAGE: metric_func(observed, lower, predicted)}) - # fraction of observations within the upper band + # Fraction of observations within the upper band. metrics.update({UPPER_BAND_COVERAGE: metric_func(observed, predicted, upper)}) - # difference between actual and intended coverage + # Difference between actual and intended coverage. metrics.update({COVERAGE_VS_INTENDED_DIFF: (metrics[PREDICTION_BAND_COVERAGE] - coverage)}) + + # Mean interval score. + enum = ValidationMetricEnum.MEAN_INTERVAL_SCORE + metric_func = enum.get_metric_func() + metric_name = enum.get_metric_name() + metrics.update({metric_name: (metric_func(observed, lower, upper, coverage))}) return metrics @@ -936,27 +986,35 @@ class EvaluationMetricEnum(Enum): """Fraction of forecasted values that deviate more than 5% from the actual""" def get_metric_func(self): - """Returns the metric function""" + """Returns the metric function.""" return self.value[0] def get_metric_greater_is_better(self): - """Returns the greater_is_better boolean""" + """Returns the greater_is_better boolean.""" return self.value[1] def get_metric_name(self): - """Returns the short name""" + """Returns the short name.""" return self.value[2] class ValidationMetricEnum(Enum): """Valid diagnostic metrics. - The values tuple is ``(score_func: callable, greater_is_better: boolean)`` + The values tuple is ``(score_func: callable, greater_is_better: boolean, short_name: str)`` """ - BAND_WIDTH = (prediction_band_width, False) - BAND_COVERAGE = (fraction_within_bands, True) + BAND_WIDTH = (prediction_band_width, False, "band_width") + BAND_COVERAGE = (fraction_within_bands, True, "band_coverage") + MEAN_INTERVAL_SCORE = (mean_interval_score, False, "MIS") + """Mean interval score""" def get_metric_func(self): + """Returns the metric function.""" return self.value[0] def get_metric_greater_is_better(self): + """Returns the greater_is_better boolean.""" return self.value[1] + + def get_metric_name(self): + """Returns the short name.""" + return self.value[2] diff --git a/greykite/common/features/adjust_anomalous_data.py b/greykite/common/features/adjust_anomalous_data.py index 03119cf..2a0d1d1 100644 --- a/greykite/common/features/adjust_anomalous_data.py +++ b/greykite/common/features/adjust_anomalous_data.py @@ -97,8 +97,8 @@ def adjust_anomalous_data( "adjustment_delta": [np.nan, 3, -5, np.nan], # extra columns for filtering "metric": ["y", "y", "z", "z"], - "platform": ["MOBILE", "MOBILE", "DESKTOP", "DESKTOP"], - "vertical": ["ads", "sales", "ads", "ads"], + "dimension1": ["level_1", "level_1", "level_2", "level_2"], + "dimension2": ["level_1", "level_2", "level_1", "level_1"], }) In the above example, @@ -108,7 +108,7 @@ def adjust_anomalous_data( - "adjustment_delta" is the column which includes the delta if it is known. The name of this column is provided using the argument ``adjustment_delta_col``. Use `numpy.nan` if the adjustment size is not known, and the adjusted value will be set to `numpy.nan`. - - "metric", "platform", and "vertical" are example columns for filtering. They + - "metric", "dimension1", and "dimension2" are example columns for filtering. They contain the metric name and dimensions for which the anomaly is applicable. ``filter_by_dict` is used to filter on these columns to get the relevant anomalies for the timeseries represented by ``df[value_col]``. @@ -183,6 +183,10 @@ def adjust_anomalous_data( anomaly_df = anomaly_df.copy() new_value_col = f"adjusted_{value_col}" + # Gets min and max timestamps from the input time series + min_ts = df[time_col].min() + max_ts = df[time_col].max() + if new_value_col in df.columns: raise ValueError( f"`df` cannot include this column name: {new_value_col}." @@ -229,10 +233,16 @@ def adjust_anomalous_data( time_values = augmented_df[time_col].astype(str) anomaly_df[start_time_col] = anomaly_df[start_time_col].astype(str) anomaly_df[end_time_col] = anomaly_df[end_time_col].astype(str) + min_ts = str(min_ts) + max_ts = str(max_ts) for i in range(anomaly_df.shape[0]): row = anomaly_df.iloc[i] t1 = row[start_time_col] t2 = row[end_time_col] + if t1 > max_ts or t2 < min_ts: + continue + t1 = max(min_ts, t1) + t2 = min(max_ts, t2) if t2 < t1: raise ValueError( f"End Time: {t2} cannot be before Start Time: {t1}, in ``anomaly_df``.") @@ -255,3 +265,163 @@ def adjust_anomalous_data( return { "adjusted_df": df, "augmented_df": augmented_df} + + +def label_anomalies_multi_metric( + df, + time_col, + value_cols, + anomaly_df, + anomaly_df_grouping_col=None, + start_time_col=START_TIME_COL, + end_time_col=END_TIME_COL): + """This function operates on a given data frame (``df``) which includes time (given in ``time_col``) and + metrics. + For each metric (given in ``value_cols``), it augments the data with a + + - a new column which determines if a value is an anomaly (``f"{metric}_is_anomaly"``) + - a new column which is not NA (`np.nan`) when the value is an anomaly (``f"{metric}_anomaly_value"``) + - a new column which is not NA (`np.nan`) when the value is non-anomalous / "normal" (``f"{metric}_normal_value"``) + + The information regarding the anomalies is stored in the input argument ``anomaly_df`` and ``anomaly_df_grouping_col`` determines + which anomaly rows in ``anomaly_df`` correspond to each metric. + + Parameters + ---------- + df : `pandas.DataFrame` + A data frame which at least inludes a timestamp column (``TIME_COL``) and + ``value_cols`` which represent the metrics. + time_col : `str` + The column name in ``df`` representing time for the time series data. + The time column can be anything that can be parsed by `pandas.DatetimeIndex`. + value_cols : `list` [`str`] + The columns which include the metrics. + anomaly_df : `pandas.DataFrame` + Data frame with ``start_time_col`` and ``end_time_col`` and ``grouping_col`` + (if provided). This contains the anomaly periods for each metric + (one of the ``value_cols``). Each row of this dataframe corresponds + to an anomaly occurring between the times given in ``row[start_time_col]`` + and ``row[end_time_col]``. + The ``grouping_col`` (if not None) determines which metric that + anomaly corresponds too (otherwise we assume all anomalies apply to all metrics). + anomaly_df_grouping_col : `str` or None, default None + The column name for grouping the list of the anomalies which is to appear + in ``anomaly_df``. + This column should include some of the metric names + specified in ``value_cols``. The ``grouping_col`` (if not None) determines which metric that + anomaly corresponds too (otherwise we assume all anomalies apply to all metrics). + start_time_col : `str`, default ``START_TIME_COL`` + The column name in ``anomaly_df`` representing the start timestamp of + the anomalous period, inclusive. + The format can be anything that can be parsed by pandas DatetimeIndex. + end_time_col : `str`, default ``END_TIME_COL`` + The column name in ``anomaly_df`` representing the start timestamp of + the anomalous period, inclusive. + The format can be anything that can be parsed by pandas DatetimeIndex. + + + Returns + ------- + result : `dict` + A dictionary with following items: + + - "augmented_df": `pandas.DataFrame` + This is a dataframe obtained by augmenting the input ``df`` with new + columns determining if the metrics appearing in ``df`` are anomaly + or not and the new columns denoting anomaly values and normal values + (described below). + - "is_anomaly_cols": `list` [`str`] + The list of add boolean columns to determine if a value is an anomaly for + a given metric. The format of the columns is ``f"{metric}_is_anomaly"``. + - "anomaly_value_cols": `list` [`str`] + The list of columns containing only anomaly values (`np.nan` otherwise) for each corresponding + metric. The format of the columns is ``f"{metric}_anomaly_value"``. + - "normal_value_cols": `list` [`str`] + The list of columns containing only non-anomalous / normal values (`np.nan` otherwise) + for each corresponding metric. The format of the columns is ``f"{metric}_normal_value"``. + + """ + df = df.copy() + anomaly_df = anomaly_df.copy() + + if time_col not in df.columns: + raise ValueError( + f"time_col: {time_col} wasn't found in data frame which has columns {df.columns}") + + for value_col in value_cols: + if value_col not in df.columns: + raise ValueError( + f"value_col: {value_col} wasn't found in data frame which has columns {df.columns}") + + if (start_time_col not in anomaly_df.columns): + raise ValueError( + f"start_time_col: {start_time_col} wasn't found in data frame which has columns {anomaly_df.columns}") + if (end_time_col not in anomaly_df.columns): + raise ValueError( + f"end_time_col: {end_time_col} wasn't found in data frame which has columns {anomaly_df.columns}") + + anomaly_df[start_time_col] = pd.to_datetime(anomaly_df[start_time_col]) + anomaly_df[end_time_col] = pd.to_datetime(anomaly_df[end_time_col]) + + df[time_col] = pd.to_datetime(df[time_col]) + + is_anomaly_cols = [] + anomaly_value_cols = [] + normal_value_cols = [] + + def add_anomaly_cols_one_metric(df, value_col): + """This function adds the new anomaly columns for each metric. + This will be applied to ``df`` once for each metric below. + + Parameters + ---------- + df : `pandas.DataFrame` + A data frame which at least inludes a timestamp column (``TIME_COL``) and + ``value_col`` which represent the metric. + value_col : `str` + The column which includes the metric of interest. + + Returns + ------- + df : `pandas.DataFrame` + A dataframe which has these new columns added to input ``df``: + + - a new column which determines if a value is an anomaly (``f"{value_col}_is_anomaly"``) + - a new column which is not NA (`np.nan`) when the value is an anomaly (``f"{value_col}_anomaly_value"``) + - a new column which is not NA (`np.nan`) when the value is non-anomalous / "normal" (``f"{value_col}_normal_value"``) + + + """ + adj_df_info = adjust_anomalous_data( + df=df, + time_col=time_col, + value_col=value_col, + anomaly_df=anomaly_df, + start_time_col=start_time_col, + end_time_col=end_time_col, + filter_by_value_col=anomaly_df_grouping_col) + + df0 = adj_df_info["augmented_df"] + df0[f"{value_col}_is_anomaly"] = df0[ANOMALY_COL] + df0[f"{value_col}_anomaly_value"] = np.nan + anomaly_ind = (df0[ANOMALY_COL] == 1) + normal_ind = (df0[ANOMALY_COL] == 0) + df0.loc[anomaly_ind, f"{value_col}_anomaly_value"] = df0.loc[anomaly_ind, value_col] + df0.loc[normal_ind, f"{value_col}_normal_value"] = df0.loc[normal_ind, value_col] + del df0[ANOMALY_COL] + del df0[f"adjusted_{value_col}"] + + is_anomaly_cols.append(f"{value_col}_is_anomaly") + anomaly_value_cols.append(f"{value_col}_anomaly_value") + normal_value_cols.append(f"{value_col}_normal_value") + + return df0 + + for value_col in value_cols: + df = add_anomaly_cols_one_metric(df, value_col) + + return { + "augmented_df": df, + "is_anomaly_cols": is_anomaly_cols, + "anomaly_value_cols": anomaly_value_cols, + "normal_value_cols": normal_value_cols} diff --git a/greykite/common/features/timeseries_features.py b/greykite/common/features/timeseries_features.py index af7a574..054b513 100644 --- a/greykite/common/features/timeseries_features.py +++ b/greykite/common/features/timeseries_features.py @@ -25,13 +25,17 @@ import math from datetime import datetime +from datetime import timedelta import numpy as np import pandas as pd +import pytz from holidays_ext import get_holidays as get_hdays +from pandas.tseries.frequencies import to_offset from scipy.special import expit from greykite.common import constants as cst +from greykite.common.python_utils import split_offset_str def convert_date_to_continuous_time(dt): @@ -76,7 +80,293 @@ def get_default_origin_for_time_vars(df, time_col): return convert_date_to_continuous_time(date) -def build_time_features_df(dt, conti_year_origin): +def pytz_is_dst_fcn(time_zone): + """For a given timezone, it constructs a function which determines + if a timestamp (`dt`) is inside the daylight saving period or not for + a list of timestamps. + + This function, should work for regions in US / Canada and Europe. + + The returned function assumes that the timestamps are in the given + ``time_zone``. + Note that since daylight saving is the same for all of mainland US / Canada, + one can pass any US time zone e.g. ``"US/Pacific"`` to construct a function + which works for all of mainland US. + Similarly for most of Europe, it is suffcient to pass any Europe time zone e.g. + ``"Europe/London"``. + + Note: Since this function is slow, a faster version is available: + `~greykite.common.features.timeseries_features.is_dst_fcn`. + However, we expect the current function would be more accurate assuming + the package `pytz` keeps up to date with potential changes in DST. + + + Parameters + ---------- + time_zone : `str` + A string denoting the timestamp e.g. "US/Pacific", "Canada/Eastern", + "Europe/London". + + Returns + ------- + is_dst : callable + A function which takes a list of datetime-like objects + and returns a list of colleans to determine if each timestamp is in daylight saving. + """ + timezone = pytz.timezone(time_zone) + + def is_dst(dt): + """A function which takes a list of datetime-like objects + and returns a list of booleans to determine if that timestamp + is in daylight saving. + + Parameters + ---------- + dt : `list` of datetime-like object + + Returns + ------- + result : `list` [`bool`] + A list of booleans: + + - If True, the input time is in daylight saving. + - If False, the input time is NOT in daylight saving. + + """ + diff = [] + for dt0 in dt: + timezone_date = timezone.localize(dt0, is_dst=False) + diff.append(timezone_date.tzinfo._dst.seconds) + return list(pd.Series(diff) != 0) + + return is_dst + + +def get_us_dst_start(year): + """For each year, it returns the second Sunday in March, + which is the start of the daylight saving (DST) in US/Canada. + + We assume DST starts on Second Sunday of March at 2 a.m. + + Parameters + ---------- + year : `int` + Year for which DST start date is desired. + + Returns + ------- + result : `datetime.datetime` + The timestamp of start of DST in US/Canada. + """ + # Finds a date in third week of March. + date_in_3rd_week = datetime(year, 3, 15, 2) + # Finds out which week day it is: + weekday = date_in_3rd_week.weekday() + # Finds the Sunday before that by going back to Monday and does an extra -1. + second_sunday_date = date_in_3rd_week.replace(day=(15 - weekday - 1)) + return second_sunday_date + + +def get_us_dst_end(year): + """For each year, it returns the first Sunday in November, + which is the end of the daylight saving (DST) in US/Canada. + + We assume DST ends on Second Sunday of Novemeber at 2 a.m. + + Parameters + ---------- + year : `int` + Year for which DST end date is desired. + + Returns + ------- + result : `datetime.datetime` + The timestamp of end of DST in US/Canada. + """ + # Finds the first date in the second week. + date_in_2nd_week = datetime(year, 11, 8, 2) + # Finds out which week day it is: + weekday = date_in_2nd_week.weekday() + # Goes back to Monday of the second week and does an extra -1. + first_sunday_date = date_in_2nd_week.replace(day=(8 - weekday - 1)) + return first_sunday_date + + +def get_eu_dst_start(year): + """For each year, it returns the last Sunday in March, + which is the start of the daylight saving (DST) in Europe. + + We assume Europe DST starts on last Sunday of March at 1 a.m. + + Parameters + ---------- + year : `int` + Year for which DST start date is desired. + + Returns + ------- + result : `datetime.datetime` + The timestamp of start of DST in Europe. + """ + # March is 31 days. + # Finds the last date in the month. + date_in_last_week = datetime(year, 3, 31, 1) + # Finds out which week day it is: + weekday = date_in_last_week.weekday() + # If above date is already a Sunday, returns it. + if weekday == 6: + return date_in_last_week + # Otherwise, goes back to the Monday of that week and does an extra -1. + last_sunday_date = date_in_last_week.replace(day=(31 - weekday - 1)) + return last_sunday_date + + +def get_eu_dst_end(year): + """For each year, it returns the last Sunday in October, + which is the end of the daylight saving (DST) in Europe. + + We assume Europe DST ends on last Sunday of October at 2 a.m. + + Parameters + ---------- + year : `int` + Year for which DST end date is desired. + + Returns + ------- + result : `datetime.datetime` + The timestamp of end of DST in Europe. + """ + # October is 31 days. + # Finds the last date in the month. + date_in_last_week = datetime(year, 10, 31, 2) + # Finds out which week day it is: + weekday = date_in_last_week.weekday() + # If above date is already a Sunday, returns it. + if weekday == 6: + return date_in_last_week + # Otherwise, goes back to the Monday of that week and does an extra -1. + last_sunday_date = date_in_last_week.replace(day=(31 - weekday - 1)) + return last_sunday_date + + +def is_dst_fcn(time_zone): + """For a given timezone, it constructs a function which determines + if a timestamp (`dt`) is inside the daylight saving period or not for + a list of timestamps. + + This function, should work for regions in US / Canada and Europe. + + The returned function assumes that the timestamps are in the given + ``time_zone``. + Note that since daylight saving is the same for all of mainland US / Canada, + one can pass any US time zone e.g. ``"US/Pacific"`` to construct a function + which works for all of mainland US. + Similarly for most of Europe, it is suffcient to pass any Europe time zone e.g. + ``"Europe/London"``. + + Some references on when did DST start in modern era: + + - Europe: https://www.timeanddate.com/time/europe/daylight-saving-history.html + - US: https://en.wikipedia.org/wiki/Daylight_saving_time_in_the_United_States + + Note: This function assumes the DST rules remain the same as what they + are in the year 2022 (when this code was written). + A potentially more accurate (but much slower) version is available: + `~greykite.common.features.timeseries_features.pytz_is_dst_fcn`. + However, we expect the current function would be much faster and it can be + updated in case DST rules change. + + + Parameters + ---------- + time_zone : `str` + A string denoting the timestamp e.g. "US/Pacific", "Canada/Eastern", + "Europe/London". + + Returns + ------- + is_dst : callable + A function which takes a list of datetime-like objects + and returns a list of colleans to determine if each timestamp is in daylight saving. + """ + if "US" in time_zone or "Canada" in time_zone: + get_dst_start = get_us_dst_start + get_dst_end = get_us_dst_end + elif "Europe" in time_zone: + get_dst_start = get_eu_dst_start + get_dst_end = get_eu_dst_end + else: + raise ValueError( + f"`time_zone` string {time_zone} does not include " + "either of: 'US'/'Canada'/'Europe'") + + # For US, the current convention seems to have started in 2007 + # See references in function docstring + us_year_range = range(2007, 2080) + us_starts = {year: get_dst_start(year) for year in us_year_range} + us_ends = {year: get_dst_end(year) for year in us_year_range} + + # For Europe, the current convention seems to have started in 1996 + # See references in function docstring: + # Quoting from the link: "In 1996, the European Union (EU) standardized the DST schedule" + europe_year_range = range(1996, 2080) + europe_starts = {year: get_dst_start(year) for year in europe_year_range} + europe_ends = {year: get_dst_end(year) for year in europe_year_range} + + if "US" in time_zone or "Canada" in time_zone: + year_range = us_year_range + starts = us_starts + ends = us_ends + else: + # Note that due to above if statements, we now that else maps to "Europe" + # Otherwise a `ValueError` would have been raised. + year_range = europe_year_range + starts = europe_starts + ends = europe_ends + + def is_dst(dt): + """A function which takes a list of datetime-like objects + and returns a list of booleans to determine if that timestamp + is in daylight saving. + + Parameters + ---------- + dt : `list` of datetime-like object + + Returns + ------- + result : `list` [`bool`] + A list of booleans: + + - If True, the input time is in daylight saving. + - If False, the input time is NOT in daylight saving. + + """ + is_dst_bool = [] + for dt0 in dt: + year = dt0.year + if year in year_range: + start, end = starts[year], ends[year] + if dt0 >= start and dt0 <= end: + # This will be true at most for one year in the range + is_dst_bool.append(True) + else: + is_dst_bool.append(False) + else: + # This is the rare case for which the timestamp is not within + # the range of all years considered in `year_range = range(1950, 2080)` + is_dst_bool.append(False) + + return is_dst_bool + + return is_dst + + +def build_time_features_df( + dt, + conti_year_origin, + add_dst_info=True): """This function gets a datetime-like vector and creates new columns containing temporal features useful for time series analysis and forecasting e.g. year, week of year, etc. @@ -84,9 +374,10 @@ def build_time_features_df(dt, conti_year_origin): ---------- dt : array-like (1-dimensional) A vector of datetime-like values - conti_year_origin : float - The origin used for creating continuous time. - + conti_year_origin : `float` + The origin used for creating continuous time which is in years unit. + add_dst_info : `bool`, default True + Determines if daylight saving columns for US and Europe should be added. Returns ------- time_features_df : `pandas.DataFrame` @@ -131,6 +422,9 @@ def build_time_features_df(dt, conti_year_origin): * "ct3": float, signed cubic growth, -infinity to infinity * "ct_sqrt": float, signed square root growth, -infinity to infinity * "ct_root3": float, signed cubic root growth, -infinity to infinity + * "us_dst": bool, determines if the time inside the daylight saving time of US + This column is only generated if ``add_dst_info=True`` + * "eu_dst": bool, determines if the time inside the daylight saving time of Europe. This column is only generated if ``add_dst_info=True`` """ dt = pd.DatetimeIndex(dt) @@ -210,7 +504,8 @@ def build_time_features_df(dt, conti_year_origin): conti_year = year + (doy - 1 + (tod / 24.0)) / year_length is_weekend = pd.Series(dow).apply(lambda x: x in [6, 7]).values # weekend indicator # categorical var with levels (Mon-Thu, Fri, Sat, Sun), could help when training data are sparse. - dow_grouped = pd.Series(str_dow).apply(lambda x: "1234-MTuWTh" if (x in ["1-Mon", "2-Tue", "3-Wed", "4-Thu"]) else x).values + dow_grouped = pd.Series(str_dow).apply( + lambda x: "1234-MTuWTh" if (x in ["1-Mon", "2-Tue", "3-Wed", "4-Thu"]) else x).values # growth terms ct1 = conti_year - conti_year_origin @@ -266,20 +561,47 @@ def build_time_features_df(dt, conti_year_origin): cst.TimeFeaturesEnum.ct_root3.value: ct_root3, } df = pd.DataFrame(features_dict) + + if add_dst_info: + df[cst.TimeFeaturesEnum.us_dst.value] = is_dst_fcn("US/Pacific")( + df[cst.TimeFeaturesEnum.datetime.value]) + + df[cst.TimeFeaturesEnum.eu_dst.value] = is_dst_fcn("Europe/London")( + df[cst.TimeFeaturesEnum.datetime.value]) + return df -def add_time_features_df(df, time_col, conti_year_origin): - """Adds a time feature data frame to a data frame - :param df: the input data frame - :param time_col: the name of the time column of interest - :param conti_year_origin: the origin of time for the continuous time variable - :return: the same data frame (df) augmented with new columns +def add_time_features_df( + df, + time_col, + conti_year_origin, + add_dst_info=True): + """Adds a time feature data frame to a data frame by calling + `~greykite.common.features.timeseries_features.build_time_features_df`. + + Parameters + ---------- + df : `pandas.Dataframe` + The input data frame + time_col: `str` + The name of the time column of interest + conti_year_origin: + The origin of time for the continuous time variable which is in years unit. + add_dst_info : `bool`, default True + Determines if daylight saving columns for US and Europe should be added. + + Returns + ------- + result : `pandas.Dataframe` + The same data frame (df) augmented with new columns generated by + `~greykite.common.features.timeseries_features.build_time_features_df` """ df = df.reset_index(drop=True) time_df = build_time_features_df( dt=df[time_col], - conti_year_origin=conti_year_origin) + conti_year_origin=conti_year_origin, + add_dst_info=add_dst_info) time_df = time_df.reset_index(drop=True) return pd.concat([df, time_df], axis=1) @@ -324,6 +646,7 @@ def get_holidays(countries, year_start, year_end): # "Easter Monday [England, Wales, Northern Ireland]". country_df[cst.EVENT_DF_LABEL_COL] = country_df[cst.EVENT_DF_LABEL_COL].str.replace("/", ", ") country_df[cst.EVENT_DF_DATE_COL] = pd.to_datetime(country_df[cst.EVENT_DF_DATE_COL]) + country_holiday_dict[country] = country_df return country_holiday_dict @@ -393,14 +716,27 @@ def add_daily_events( df, event_df_dict, date_col=cst.EVENT_DF_DATE_COL, - regular_day_label=cst.EVENT_DEFAULT): + regular_day_label=cst.EVENT_DEFAULT, + neighbor_impact=None, + shifted_effect=None): """For each key of event_df_dict, it adds a new column to a data frame (df) - with a date column (date_col). - Each new column will represent the events given for that key. + with a date column (date_col). + Each new column will represent the events given for that key. + This function also generates 3 binary event flags + ``IS_EVENT_EXACT_COL``, ``IS_EVENT_ADJACENT_COL`` and ``IS_EVENT_COL`` + given the information in ``event_df_dict`` with the following logic: + + (1) If the key contains "_minus_" or "_plus_", that means the event + was generated by the ``add_event_window`` function, and it is a + neighboring day of some exact event day. + In this case, ``IS_EVENT_ADJACENT_COL`` will be 1 for all days in this key. + + (2) Otherwise the key indicates that it is on the exact event day being modeled. + In this case, ``IS_EVENT_EXACT_COL`` will be 1 for all days in this key. - Notes - ----- - As a side effect, the columns in ``event_df_dict`` are renamed. + (3) If a date appears in both types of keys, both above columns will be 1. + + (4) ``IS_EVENT_COL`` is 1 for all dates in the provided ``event_df_dict``. Parameters ---------- @@ -421,6 +757,36 @@ def add_daily_events( the events in ``event_df_dict``. regular_day_label : `str` The label used for regular days which are not "events". + neighbor_impact : `int`, `list` [`int`], callable or None, default None + The impact of neighboring timestamps of the events in ``event_df_dict``. + This is for daily events so the units below are all in days. + + For example, if the data is weekly ("W-SUN") and an event is daily, + it may not exactly fall on the weekly date. + But you can specify for New Year's day on 1-1, it affects all dates + in the week, e.g. 12-31, 1-1, ..., 1-6, then it will be mapped to the weekly date. + In this case you may want to map a daily event's date to a few dates, + and can specify + ``neighbor_impact=lambda x: [x-timedelta(days=x.isocalendar()[2]-1) + timedelta(days=i) for i in range(7)]``. + + Another example is that the data is rolling 7 day daily data, + thus a holiday may affect the t, t+1, ..., t+6 dates. + You can specify ``neighbor_impact=7``. + + If input is `int`, the mapping is t, t+1, ..., t+neighbor_impact-1. + If input is `list`, the mapping is [t+x for x in neighbor_impact]. + If input is a function, it maps each daily event's date to a list of dates. + shifted_effect : `list` [`str`] or None, default None + Additional neighbor events based on given events. + For example, passing ["-1D", "7D"] will add extra daily events which are 1 day before + and 7 days after the given events. + Offset format is {d}{freq} with any integer plus a frequency string. Must be parsable by pandas ``to_offset``. + The new events' names will be the current events' names with suffix "{offset}_before" or "{offset}_after". + For example, if we have an event named "US_Christmas Day", + a "7D" shift will have name "US_Christmas Day_7D_after". + This is useful when you expect an offset of the current holidays also has impact on the + time series, or you want to interact the lagged terms with autoregression. + If ``neighbor_impact`` is also specified, this will be applied after adding neighboring days. Returns ------- @@ -429,13 +795,72 @@ def add_daily_events( one for each key of ``event_df_dict``. """ df[date_col] = pd.to_datetime(df[date_col]) + get_neighbor_days_func = None + new_event_cols = [cst.IS_EVENT_EXACT_COL, cst.IS_EVENT_ADJACENT_COL, cst.IS_EVENT_COL] + event_flag_df_list = [] + if neighbor_impact is not None: + if isinstance(neighbor_impact, int): + neighbor_impact = sorted([neighbor_impact, 0]) + neighbor_impact[1] += 1 + + def get_neighbor_days_func(date): + return [date + timedelta(days=d) for d in range(*neighbor_impact)] + elif isinstance(neighbor_impact, list): + def get_neighbor_days_func(date): + return [date + timedelta(days=d) for d in neighbor_impact] + else: + get_neighbor_days_func = neighbor_impact for label, event_df in event_df_dict.items(): - event_df = event_df.copy() + event_df = event_df.copy().drop_duplicates() # Makes a copy to avoid modifying input new_col = f"{cst.EVENT_PREFIX}_{label}" event_df.columns = [date_col, new_col] event_df[date_col] = pd.to_datetime(event_df[date_col]) + # Handles neighboring impact. + if get_neighbor_days_func is not None: + new_event_df = None + for i in range(len(event_df)): + mapped_dates = get_neighbor_days_func(event_df["date"].iloc[i]) + new_event_df = pd.concat([ + new_event_df, event_df.iloc[[i] * len(mapped_dates)].assign(**{date_col: mapped_dates})], + axis=0 + ) + event_df = new_event_df.drop_duplicates().reset_index(drop=True) df = df.merge(event_df, on=date_col, how="left") df[new_col] = df[new_col].fillna(regular_day_label) + # Adds neighbor events if requested. + new_event_dfs = [] + if shifted_effect is not None: + for lag in shifted_effect: + num, freq = split_offset_str(lag) + num = int(num) + if num != 0: + lag_offset = to_offset(lag) + new_event_df = event_df.copy() + new_event_df[date_col] += lag_offset + suffix = cst.EVENT_SHIFTED_SUFFIX_BEFORE if num < 0 else cst.EVENT_SHIFTED_SUFFIX_AFTER + new_col = f"{cst.EVENT_PREFIX}_{label}_{abs(num)}{freq}{suffix}" + new_event_df.columns = [date_col, new_col] + new_event_df[new_col] += f"_{abs(num)}{freq}{suffix}" + df = df.merge(new_event_df, on=date_col, how="left") + df[new_col] = df[new_col].fillna(regular_day_label) + new_event_dfs.append(new_event_df) + # Generates event indicators. + # `augmented_event_df` contains `date_col` and three event indicator columns to be added to `df`. + for event_df_temp in [event_df] + new_event_dfs: + augmented_event_df = event_df_temp[[date_col]].drop_duplicates() + is_event_adjacent = "_minus_" in label or "_plus_" in label + augmented_event_df[cst.IS_EVENT_EXACT_COL] = 0 if is_event_adjacent else 1 + augmented_event_df[cst.IS_EVENT_ADJACENT_COL] = 1 if is_event_adjacent else 0 + augmented_event_df[cst.IS_EVENT_COL] = 1 # In either case, `IS_EVENT_COL` is 1. + event_flag_df_list.append(augmented_event_df) + + event_flag_df = pd.concat(event_flag_df_list) + # Sets a day as 1 if it is marked by any of the keys in `event_df_dict`. + event_flag_df = event_flag_df.groupby(by=date_col)[new_event_cols].sum().reset_index(drop=False) + event_flag_df[new_event_cols] = 1 * (event_flag_df[new_event_cols] > 0) + # Joins the new event indicators to `df`. + df = df.merge(event_flag_df, on=date_col, how="left") + df[new_event_cols] = df[new_event_cols].fillna(0) return df @@ -673,13 +1098,20 @@ def growth_func(x): else: time_postfixes = [""] * len(changepoint_values) - changepoint_df = pd.DataFrame() + changepoint_df_list = [] for i, changepoint in enumerate(changepoint_values): time_feature = np.array(df[continuous_time_col]) - changepoint # shifted time column (t - c_i) growth_term = np.array([growth_func(max(x, 0)) for x in time_feature]) # growth as a function of time time_feature_ind = time_feature >= 0 # Indicator(t >= c_i), lets changepoint take effect starting at c_i - new_col = growth_term * time_feature_ind - changepoint_df[f"{cst.CHANGEPOINT_COL_PREFIX}{i}{time_postfixes[i]}"] = new_col + new_col = pd.Series( + data=growth_term * time_feature_ind, + name=f"{cst.CHANGEPOINT_COL_PREFIX}{i}{time_postfixes[i]}" + ) + changepoint_df_list.append(new_col) + if len(changepoint_values) > 0: + changepoint_df = pd.concat(objs=changepoint_df_list, axis=1, ignore_index=False) + else: + changepoint_df = pd.DataFrame() return changepoint_df @@ -1024,7 +1456,7 @@ def fourier_series_fcn(col_name, period=1.0, order=1, seas_name=None): """ def fs_func(df): - out_df = pd.DataFrame() + out_df_list = [] out_cols = [] if col_name not in df.columns: @@ -1048,8 +1480,12 @@ def fs_func(df): out_cols.append(cos_col_name) omega = 2 * math.pi / period u = omega * k * x - out_df[sin_col_name] = np.sin(u) - out_df[cos_col_name] = np.cos(u) + out_df_list.append(pd.Series(data=np.sin(u), name=sin_col_name)) + out_df_list.append(pd.Series(data=np.cos(u), name=cos_col_name)) + if len(out_df_list) > 0: + out_df = pd.concat(objs=out_df_list, axis=1, ignore_index=False) + else: + out_df = pd.DataFrame() return {"df": out_df, "cols": out_cols} return fs_func @@ -1134,7 +1570,12 @@ def signed_pow_fcn(y): signed_sq = signed_pow_fcn(2) -def logistic(x, growth_rate=1.0, capacity=1.0, floor=0.0, inflection_point=0.0): +def logistic( + x, + growth_rate=1.0, + capacity=1.0, + floor=0.0, + inflection_point=0.0): """Evaluates the logistic function at x with the specified growth rate, capacity, floor, and inflection point. @@ -1154,7 +1595,11 @@ def logistic(x, growth_rate=1.0, capacity=1.0, floor=0.0, inflection_point=0.0): return floor + capacity * expit(growth_rate * (x - inflection_point)) -def get_logistic_func(growth_rate=1.0, capacity=1.0, floor=0.0, inflection_point=0.0): +def get_logistic_func( + growth_rate=1.0, + capacity=1.0, + floor=0.0, + inflection_point=0.0): """Returns a function that evaluates the logistic function at t with the specified growth rate, capacity, floor, and inflection point. diff --git a/greykite/common/features/timeseries_lags.py b/greykite/common/features/timeseries_lags.py index 92293b2..f652aa4 100644 --- a/greykite/common/features/timeseries_lags.py +++ b/greykite/common/features/timeseries_lags.py @@ -76,13 +76,17 @@ def build_lag_df( orders = range(1, max_order + 1) if df is not None: - lag_df = pd.DataFrame() + lag_df_list = [] for i in orders: col_name = f"{value_col}{cst.LAG_INFIX}{i}" col_names.append(col_name) if df is not None: - lag_df[col_name] = df[value_col].shift(i) + lag_df_list.append(df[value_col].shift(i).rename(col_name)) + + if df is not None: + lag_df = pd.concat(objs=lag_df_list, axis=1, ignore_index=False) + return { "lag_df": lag_df, "col_names": col_names} @@ -267,7 +271,7 @@ def build_agg_lag_df( max_order=max_order, orders=None) lag_df = lag_info["lag_df"] - agg_lag_df = pd.DataFrame() + agg_lag_df_list = [] for orders in orders_list: if len(orders) > len(set(orders)): @@ -280,12 +284,14 @@ def build_agg_lag_df( if df is not None: if agg_func == "mean": # uses vectorized mean for speed - agg_lag_df[col_name] = ( - lag_df.iloc[:, orders_col_index].mean(axis=1)) + agg_lag_df_list.append( + lag_df.iloc[:, orders_col_index].mean(axis=1).rename(col_name) + ) else: # generic aggregation - agg_lag_df[col_name] = ( - lag_df.iloc[:, orders_col_index].apply(agg_func, axis=1)) + agg_lag_df_list.append( + lag_df.iloc[:, orders_col_index].apply(agg_func, axis=1).rename(col_name) + ) for interval in interval_list: if len(interval) != 2: @@ -302,8 +308,12 @@ def build_agg_lag_df( col_name = f"{value_col}_{agg_name}_{col_suffix}" col_names.append(col_name) if df is not None: - agg_lag_df[col_name] = ( - lag_df.iloc[:, orders_col_index].apply(agg_func, axis=1)) + agg_lag_df_list.append( + lag_df.iloc[:, orders_col_index].apply(agg_func, axis=1).rename(col_name) + ) + + if df is not None: + agg_lag_df = pd.concat(objs=agg_lag_df_list, axis=1, ignore_index=False) return { "agg_lag_df": agg_lag_df, @@ -461,14 +471,14 @@ def build_lags_func(df, past_df=None): # if past_df length (number of rows) is smaller than max_order # we expand it to avoid NULLs if past_df.shape[0] < max_order: - past_df_addition = pd.DataFrame( - {value_col: [np.nan]*(max_order - past_df.shape[0])}) - past_df = past_df_addition.append(past_df) + past_df_list = [pd.DataFrame( + {value_col: [np.nan]*(max_order - past_df.shape[0])}), past_df] + past_df = pd.concat(past_df_list) # df is expanded by adding past_df as the past data for df # this will help in avoiding NULLs to appear in lag_df and agg_lag_df # as long as past_df has data in it or expanded df is interpolated - df_expanded = past_df.append(df) + df_expanded = pd.concat([past_df, df]) if series_na_fill_func is not None: df_expanded[value_col] = series_na_fill_func(df_expanded[value_col]) diff --git a/greykite/framework/benchmark/gen_moving_timeseries_forecast.py b/greykite/common/gen_moving_timeseries_forecast.py similarity index 100% rename from greykite/framework/benchmark/gen_moving_timeseries_forecast.py rename to greykite/common/gen_moving_timeseries_forecast.py diff --git a/greykite/common/logging.py b/greykite/common/logging.py index 122418a..a96ebde 100644 --- a/greykite/common/logging.py +++ b/greykite/common/logging.py @@ -22,6 +22,7 @@ """Logging functions.""" import logging +import sys from enum import Enum import numpy as np @@ -30,6 +31,14 @@ from greykite.common.constants import LOGGER_NAME +# Add time stamp to logging message +logging.basicConfig( + level=logging.INFO, + stream=sys.stdout, + format="%(asctime)s %(name)s [%(levelname)s]: %(message)s", + datefmt="%Y-%m-%d %H:%M:%S", +) + # Here we name the logger "LOGGER_NAME". # We don't want to change the default behaviour of the root logger, # which will influence the behaviour of other modules. @@ -97,7 +106,7 @@ def pprint(params, offset=0, printer=repr): np.set_printoptions(**options) lines = ''.join(params_list) # Strip trailing space to avoid nightmare in doctests - lines = '\n'.join(l.rstrip(' ') for l in lines.split('\n')) + lines = '\n'.join(line.rstrip(' ') for line in lines.split('\n')) return lines diff --git a/greykite/common/python_utils.py b/greykite/common/python_utils.py index 1abb0c8..eddc1b6 100644 --- a/greykite/common/python_utils.py +++ b/greykite/common/python_utils.py @@ -27,6 +27,7 @@ import re import warnings from dataclasses import field +from typing import List import numpy as np import pandas as pd @@ -287,8 +288,8 @@ def assert_equal( } } kwargs : keyword args, optional - Keyword args to pass to `pandas.util.testing.assert_frame_equal`, - `pandas.util.testing.assert_series_equal`. + Keyword args to pass to `pandas.testing.assert_frame_equal`, + `pandas.testing.assert_series_equal`. Raises ------ @@ -831,3 +832,26 @@ def group_strs_with_regex_patterns( strings_list = [x for x in strings_list if x not in group] return {"str_groups": str_groups, "remainder": strings_list} + + +def split_offset_str( + offset_str: str) -> List[str]: + """Splits a pandas offset string into number part and frequency string part. + + Parameters + ---------- + offset_str : `str` + An offset string parsable by `pandas`. + For example, "7D", "-5H", etc. + The number part should only include numbers and "+" or "-". + The frequency part should only include letters. + + Returns + ------- + split_strs : `list` [`str`] + A list of split strings. + For example, ["7", "D"]. + """ + freq = offset_str.lstrip("+-012334556789") + num = offset_str[:-len(freq)] + return [num, freq] diff --git a/greykite/common/testing_utils.py b/greykite/common/testing_utils.py index accc0c3..dbef687 100644 --- a/greykite/common/testing_utils.py +++ b/greykite/common/testing_utils.py @@ -394,14 +394,16 @@ def generate_anomalous_data(periods=10): "ts": ts, "y": range(periods), "z": range(20, 20+periods)}) + df["y"] = df["y"].astype(float) + df["z"] = df["z"].astype(float) anomaly_df = pd.DataFrame({ - METRIC_COL: ["y", "y", "z", "z"], - "platform": ["MOBILE", "MOBILE", "DESKTOP", "DESKTOP"], - "vertical": ["ads", "sales", "ads", "ads"], - START_TIME_COL: ["1/1/2018", "1/4/2018", "1/8/2018", "1/10/2018"], - END_TIME_COL: ["1/2/2018", "1/6/2018", "1/9/2018", "1/10/2018"], - ADJUSTMENT_DELTA_COL: [np.nan, 3., -5., np.nan]}) + METRIC_COL: ["y", "y", "z", "z", "z"], + "dimension1": ["level_1", "level_1", "level_2", "level_2", "level_2"], + "dimension2": ["level_1", "level_2", "level_1", "level_1", "level_1"], + START_TIME_COL: ["1/1/2018", "1/4/2018", "1/8/2018", "1/10/2018", "1/1/2099"], + END_TIME_COL: ["1/2/2018", "1/6/2018", "1/9/2018", "1/10/2018", "1/2/2099"], + ADJUSTMENT_DELTA_COL: [np.nan, 3., -5., np.nan, np.nan]}) for col in [START_TIME_COL, END_TIME_COL]: anomaly_df[col] = pd.to_datetime(anomaly_df[col]) diff --git a/greykite/common/time_properties.py b/greykite/common/time_properties.py index e513ce3..d8a68b6 100644 --- a/greykite/common/time_properties.py +++ b/greykite/common/time_properties.py @@ -175,7 +175,7 @@ def fill_missing_dates(df, time_col=TIME_COL, freq=None): If the timestamps in ``df`` are not evenly spaced, irregular timestamps may be removed. """ - freq = freq if freq is not None else pd.infer_freq(df[time_col]) + freq = freq if freq is not None else infer_freq(df, time_col) df = df.reset_index(drop=True) complete_dates = pd.DataFrame({ time_col: pd.date_range( @@ -207,7 +207,7 @@ def get_canonical_data( freq: str = None, date_format: str = None, tz: str = None, - train_end_date: datetime = None, + train_end_date: Optional[Union[str, datetime.datetime]] = None, regressor_cols: List[str] = None, lagged_regressor_cols: List[str] = None, anomaly_info: Optional[Union[Dict, List[Dict]]] = None): @@ -234,7 +234,7 @@ def get_canonical_data( If None (recommended), inferred by `pandas.to_datetime`. tz : `str` or pytz.timezone object or None, default None Passed to `pandas.tz_localize` to localize the timestamp. - train_end_date : `datetime.datetime` or None, default None + train_end_date : `str` or `datetime.datetime` or None, default None Last date to use for fitting the model. Forecasts are generated after this date. If None, it is set to the minimum of ``self.last_date_for_val`` and ``self.last_date_for_reg``. @@ -353,7 +353,7 @@ def get_canonical_data( UserWarning) df = df_standardized.sort_values(by=TIME_COL) # Infers data frequency - inferred_freq = pd.infer_freq(df[TIME_COL]) + inferred_freq = infer_freq(df, TIME_COL) if freq is None: freq = inferred_freq elif inferred_freq is not None and freq != inferred_freq: @@ -377,33 +377,21 @@ def get_canonical_data( if tz is not None: df = df.tz_localize(tz) - df_before_adjustment = None - if anomaly_info is not None: - # Saves values before adjustment. - df_before_adjustment = df.copy() - # Adjusts columns in df (e.g. `value_col`, `regressor_cols`) - # using the anomaly info. One dictionary of parameters - # for `adjust_anomalous_data` is provided for each column to adjust. - if not isinstance(anomaly_info, (list, tuple)): - anomaly_info = [anomaly_info] - for single_anomaly_info in anomaly_info: - adjusted_df_dict = adjust_anomalous_data( - df=df, - time_col=TIME_COL, - **single_anomaly_info) - # `self.df` with values for single_anomaly_info["value_col"] adjusted. - df = adjusted_df_dict["adjusted_df"] + # Replaces infinity values in `value_col` by `np.nan` + df[value_col].replace([np.inf, -np.inf], np.nan, inplace=True) - # Standardizes `value_col` name - df_before_adjustment.rename({ - value_col: VALUE_COL - }, axis=1, inplace=True) - # Standardizes `value_col` name + # Saves values before adjustment. + df_original_value_col = df.copy() + # Standardizes `value_col` name. df.rename({ value_col: VALUE_COL }, axis=1, inplace=True) - # Finds date of last available value + # Finds date of last available value. + # - `last_date_for_val` is the last timestamp with non-null values in `VALUE_COL`. + # - `last_date_for_reg` is the last timestamp with non-null values in `regressor_cols`. + # - `max_train_end_date` is inferred as the minimum of the above two. + # `max_train_end_date` will be used to determine `train_end_date` when the latter is not provided. last_date_available = df[TIME_COL].max() last_date_for_val = df[df[VALUE_COL].notnull()][TIME_COL].max() last_date_for_reg = None @@ -420,6 +408,59 @@ def get_canonical_data( regressor_cols = [] max_train_end_date = last_date_for_val + # Chooses appropriate `train_end_date`. + # Case 1: if not provided, the last timestamp with a non-null value (`max_train_end_date`) is used. + # Case 2: if it is out of the range of the data, raises an error since it should not be allowed. + # Case 3: otherwise, we respect the user's input `train_end_date`. NAs are kept and can be imputed in the pipeline. + train_end_date = pd.to_datetime(train_end_date) + if train_end_date is None: + train_end_date = max_train_end_date + warnings.warn( + f"`train_end_date` is not provided, or {value_col} column of the provided time series contains " + f"null values at the end, or the input `train_end_date` is beyond the last timestamp available. " + f"Setting `train_end_date` to the last timestamp with a non-null value ({train_end_date}).", + UserWarning) + elif train_end_date > last_date_available: + # TODO: replace the warning with a `ValueError` and bump the version since it changes the behavior. + train_end_date = max_train_end_date + warnings.warn( + f"{value_col} column of the provided time series contains " + f"null values at the end, or the input `train_end_date` is beyond the last timestamp available. " + f"Setting `train_end_date` to the last timestamp with a non-null value ({train_end_date}).", + UserWarning) + elif train_end_date > max_train_end_date: + # Does not modify the user-input `train_end_date`, but raises a warning. + warnings.warn( + f"{value_col} column of the provided time series contains trailing NAs. " + f"These NA values will be imputed in the pipeline.", + UserWarning) + + df_before_adjustment = None + if anomaly_info is not None: + # Saves values before adjustment. + df_before_adjustment = df_original_value_col.copy() + # Adjusts columns in df (e.g. `value_col`, `regressor_cols`) + # using the anomaly info. One dictionary of parameters + # for `adjust_anomalous_data` is provided for each column to adjust. + if not isinstance(anomaly_info, (list, tuple)): + anomaly_info = [anomaly_info] + for single_anomaly_info in anomaly_info: + adjusted_df_dict = adjust_anomalous_data( + df=df_original_value_col, + time_col=TIME_COL, + **single_anomaly_info) + # `self.df` with values for single_anomaly_info["value_col"] adjusted. + df_original_value_col = adjusted_df_dict["adjusted_df"] + # Standardizes `value_col` name. + df_before_adjustment.rename({ + value_col: VALUE_COL + }, axis=1, inplace=True) + # Standardizes `value_col` name. + df = df_original_value_col.rename({ + value_col: VALUE_COL + }, axis=1, inplace=False) + + # Processes lagged regressors. last_date_for_lag_reg = None if lagged_regressor_cols: available_regressor_cols = [col for col in df.columns if col not in [TIME_COL, VALUE_COL]] @@ -432,25 +473,6 @@ def get_canonical_data( else: lagged_regressor_cols = [] - # Chooses appropriate train_end_date - if train_end_date is None: - train_end_date = max_train_end_date - if train_end_date < last_date_available: - warnings.warn( - f"{value_col} column of the provided TimeSeries contains " - f"null values at the end. Setting 'train_end_date' to the last timestamp with a " - f"non-null value ({train_end_date}).", - UserWarning) - elif train_end_date > max_train_end_date: - warnings.warn( - f"Input timestamp for the parameter 'train_end_date' " - f"({train_end_date}) either exceeds the last available timestamp or" - f"{value_col} column of the provided TimeSeries contains null " - f"values at the end. Setting 'train_end_date' to the last timestamp with a " - f"non-null value ({max_train_end_date}).", - UserWarning) - train_end_date = max_train_end_date - extra_reg_cols = [col for col in df.columns if col not in regressor_cols and col in lagged_regressor_cols] fit_cols = [TIME_COL, VALUE_COL] + regressor_cols + extra_reg_cols fit_df = df[df[TIME_COL] <= train_end_date][fit_cols] @@ -469,3 +491,48 @@ def get_canonical_data( "last_date_for_reg": last_date_for_reg, "last_date_for_lag_reg": last_date_for_lag_reg, } + + +def infer_freq( + df, + time_col=TIME_COL, + window_size=20): + """Infers frequency of the timestamps provided in the ``time_col`` of ``df``. + + Notes + ----- + If the timeseries does not have any missing values the + ``pandas.infer_freq`` can correctly infer any valid frequency with 20 datapoints. + Valid frequencies are listed here: + https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases. + + Parameters + ---------- + df : `pandas.DataFrame` + Dataframe with column ``time_col`` + time_col: `str` or None, default TIME_COL + Time column name + window_size: `int` or None, default 20 + Window size to subset ``df`` at each iteration + + Returns + ------- + freq : `str` + Inferred frequency of the timestamps provided in the ``time_col`` of ``df``. + """ + df = df.copy() + df = df[df[time_col].notna()] + df[time_col] = pd.to_datetime(df[time_col]) + freq = pd.infer_freq(df[time_col]) + if freq is None: + start_index = 0 + end_index = start_index + window_size + while end_index <= df.shape[0]: + df_temp = df.iloc[start_index:end_index] + freq = pd.infer_freq(df_temp[time_col]) + if freq is not None: + break + start_index = end_index + end_index = start_index + window_size + + return freq diff --git a/greykite/common/time_properties_forecast.py b/greykite/common/time_properties_forecast.py index 5053baa..ada5659 100644 --- a/greykite/common/time_properties_forecast.py +++ b/greykite/common/time_properties_forecast.py @@ -100,7 +100,8 @@ def get_forecast_time_properties( Frequency strings can have multiples, e.g. '5H'. See https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases for a list of frequency aliases. - If None, inferred by pd.infer_freq. + If None, inferred by + `~greykite.common.time_properties.infer_freq`. Provide this parameter if ``df`` has missing timepoints. date_format : `str` or None, default None strftime format to parse time column, eg ``%m/%d/%Y``. diff --git a/greykite/common/viz/colors_utils.py b/greykite/common/viz/colors_utils.py index 1fec454..0665432 100644 --- a/greykite/common/viz/colors_utils.py +++ b/greykite/common/viz/colors_utils.py @@ -18,9 +18,11 @@ # ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS # SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -# original author: Sayan Patra +# original author: Sayan Patra, Kaixu Yang, Reza Hosseini """Color palette for plotting.""" +import numpy as np +from matplotlib.cm import get_cmap from plotly.colors import DEFAULT_PLOTLY_COLORS from plotly.colors import n_colors from plotly.colors import validate_colors @@ -58,3 +60,45 @@ def get_color_palette(num, colors=DEFAULT_PLOTLY_COLORS): num, colortype="rgb") return color_palette + + +def get_distinct_colors( + num_colors, + opacity=0.95): + """Gets ``num_colors`` most distinguishable colors. + Uses color maps "tab10", "tab20" or "viridis" depending on the + number of colors needed. + See above color pallettes here: + https://matplotlib.org/stable/tutorials/colors/colormaps.html + + Parameters + ---------- + num_colors : `int` + The number of colors needed. + opacity : `float`, default 0.95 + The opacity of the color. This has to be a number between 0 and 1. + + Returns + ------- + colors : `list` [`str`] + A list of string colors in RGB. + """ + if opacity < 0 or opacity > 1: + raise ValueError("Opacity must be between 0 and 1.") + + if num_colors <= 10: + colors = get_cmap("tab10").colors + elif num_colors <= 20: + colors = get_cmap("tab20").colors + elif num_colors <= 256: + # Removes default opacity by ":3". + colors = get_cmap(name="viridis")(np.linspace(0, 1, num_colors))[:, :3] + else: + raise ValueError("The maximum number of colors is 256.") + + result = [] + for color in colors: + # Converts the color components to "rgba" format + color = f"rgba{int(color[0] * 255), int(color[1] * 255), int(color[2] * 255), opacity}" + result.append(color) + return result[:num_colors] diff --git a/greykite/common/viz/timeseries_annotate.py b/greykite/common/viz/timeseries_annotate.py index bb1249f..59d1214 100644 --- a/greykite/common/viz/timeseries_annotate.py +++ b/greykite/common/viz/timeseries_annotate.py @@ -20,9 +20,20 @@ # SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. # original author: Reza Hosseini, Kaixu Yang """Plotting functions to add annotations to timseries""" - +import pandas as pd import plotly.graph_objects as go +from greykite.common.constants import ACTUAL_COL +from greykite.common.constants import ANOMALY_COL +from greykite.common.constants import END_TIME_COL +from greykite.common.constants import PREDICTED_ANOMALY_COL +from greykite.common.constants import PREDICTED_COL +from greykite.common.constants import START_TIME_COL +from greykite.common.constants import TIME_COL +from greykite.common.features.adjust_anomalous_data import label_anomalies_multi_metric +from greykite.common.viz.colors_utils import get_distinct_colors +from greykite.common.viz.timeseries_plotting import plot_forecast_vs_actual + def plt_annotate_series( df, @@ -344,3 +355,796 @@ def plt_compare_series_annotations( fig = go.Figure(data=data, layout=layout) return fig + + +def plot_lines_markers( + df, + x_col, + line_cols=None, + marker_cols=None, + line_colors=None, + marker_colors=None): + """A lightweight, easy-to-use function to create a plotly figure of given + lines (curves) and markers (points) from the columns of a dataframe with a + legend which matches the column names. + This can be used for example to annotate multiple curves with markers + with an easy function call. + + Parameters + ---------- + df : `pandas.DataFrame` + Data frame with ``x_col`` and value columns specified in ``line_cols`` and ``marker_cols``. + x_col : `str` + The column used for the x-axis. + line_cols : `list` [`str`] or None, default None + The list of y-axis variables to be plotted as lines / curves. + marker_cols : `list` [`str`] or None, default None + The list of y-axis variables to be plotted as markers / points. + line_colors : `list` [`str`] or None, default None + The list of colors to be used for each corresponding line given in ``line_cols`` + marker_colors : `list` [`str`] or None, default None + The list of colors to be used for each corresponding line given in ``line_cols`` + title : `str` or None, default None + Plot title. If None, default is based on axis labels. + + Returns + ------- + fig : `plotly.graph_objects.Figure` + Interactive plotly graph of one or more columns in ``df`` against ``x_col``. + """ + + if line_colors is not None: + if len(line_colors) != len(line_cols): + raise ValueError( + "If `line_colors` is passed, its length must be equal to `line_cols`") + + if marker_colors is not None: + if len(marker_colors) != len(marker_cols): + raise ValueError( + "If `line_colors` is passed, its length must be equal to `line_cols`") + + if line_cols is None and marker_cols is None: + raise ValueError( + "At least one of `line_cols` or `marker_cols` must be passed as a list of strings (not None).") + + fig = go.Figure() + # Below we count the number of figure components to assign proper labels to legends + count_fig_data = -1 + if line_cols is not None: + for i, col in enumerate(line_cols): + if line_colors is not None: + line = go.scatter.Line(color=line_colors[i]) + else: + line = go.scatter.Line() + + fig.add_trace(go.Scatter( + x=df[x_col], + y=df[col], + mode="lines", + line=line, + showlegend=True)) + count_fig_data += 1 + fig["data"][count_fig_data]["name"] = col + + if marker_cols is not None: + for i, col in enumerate(marker_cols): + if marker_colors is not None: + marker = go.scatter.Marker(color=marker_colors[i]) + else: + marker = go.scatter.Marker() + fig.add_trace(go.Scatter( + x=df[x_col], + y=df[col], + mode="markers", + marker=marker, + showlegend=True)) + count_fig_data += 1 + fig["data"][count_fig_data]["name"] = col + + return fig + + +def plot_event_periods_multi( + periods_df, + start_time_col=START_TIME_COL, + end_time_col=END_TIME_COL, + freq=None, + grouping_col=None, + min_timestamp=None, + max_timestamp=None, + new_cols_tag="_is_anomaly", + title="anomaly periods"): + """For a dataframe (``periods_df``) with rows denoting start and end of the periods, + it plots the periods. If there extra segmentation is given (``grouping_col``) then + the periods in each segment/slice will be plotted separately on top of each other so that + their overlap can be seen easily. + + Parameters + ---------- + periods_df : `pandas.DataFrame` + Data frame with ``start_time_col`` and ``end_time_col`` and optionally + ``grouping_col`` if passed. + start_time_col : `str`, default START_TIME_COL + The column denoting the start of a period. The type can be any type + admissable as `pandas.to_datetime`. + end_time_col : `str`, default END_TIME_COL + The column denoting the start of a period. The type can be any type + admissable as `pandas.to_datetime`. + freq : `str` or None, default None + Frequency of the generated time grid which is used to plot the horizontal + line segments (points). + + If None, we use hourly as default (freq = "H") which will be accurate + for timestamps which are rounded up to an hour. For finer timestamps + user should specify higher frequencies e.g. "min" for minutely. Also for + daily, weekly, monthly data, user can use a lower frequency to make the plot + size on disk smaller. + grouping_col : `str` or None, default None + A column which specifies the slicing. + Each slice's event / anomaly period will be plotted with a specific color + for that slice. Each segment will appear at a different height of the y-axis + and its periods would be annotated at that height. + min_timestamp : `str` or None, default None + A string denoting the starting point (time) the x axis. + If None, the minimum of ``start_time_col`` will be used. + max_timestamp : `str` or None, default None + A string denoting the end point (time) for the x axis. + If None, the maximum of ``end_time_col`` will be used. + title : `str` or None, default None + Plot title. If None, default is based on axis labels. + new_cols_tag : `str`, default "_is_anomaly" + The tag used in the column names for each group. + The column name has this format: f"{group}{new_cols_tag}". + For example if a group is "impressions", with the default value + of this argument the added column name is "impressions_is_anomaly" + + Returns + ------- + result : `dict` + + - "fig" : `plotly.graph_objects.Figure` + Interactive plotly graph of periods given for each group. + - "labels_df" : `pandas.DataFrame` + A dataframe which includes timestamps as one column (TIME_COL) and one + dummy string column for each group. + The values of the new columns are None, except for the time periods specified + in each corresponding group. + - "groups" : `list` [`str`] + The distinct values seen in ``df[grouping_col]`` which are used for + slicing of data. + - "new_cols" : `list` [`str`] + The new columns generated and added to ``labels_df``. + Each column corresponds to one slice of the data as specified in + ``grouping_col``. + - "ts" : `list` [`pandas._libs.tslibs.timestamps.Timestamp`] + A time-grid generated by ``pandas.date_range`` + - "min_timestamp" : `str` or None, default None + A string denoting the starting point (time) the x axis. + - "max_timestamp" : `str` or None, default None + A string denoting the end point (time) for the x axis. + - "marker_colors" : `list` [`str`] + A list of strings denoting the colors used for various slices. + + """ + periods_df = periods_df.copy() + if min_timestamp is None: + min_timestamp = periods_df[start_time_col].min() + if max_timestamp is None: + max_timestamp = periods_df[end_time_col].max() + + periods_df = periods_df[ + (periods_df[start_time_col] >= min_timestamp) & + (periods_df[end_time_col] <= max_timestamp)].reset_index(drop=True) + + if freq is None: + freq = "H" + + ts = pd.date_range(start=min_timestamp, end=max_timestamp, freq=freq) + labels_df = pd.DataFrame({TIME_COL: ts}) + # Converting the time to standard string format in order to use plotly safely + labels_df[TIME_COL] = labels_df[TIME_COL].dt.strftime("%Y-%m-%d %H:%M:%S") + + def add_periods_dummy_column_for_one_group( + labels_df, + periods_df, + new_col, + label): + """This function will add a dummy column for the time periods of each group to the ``labels_df``. + Parameters + ---------- + labels_df : `pandas.DataFrame` + A data frame which at least inludes a timestamp column (TIME_COL). + periods_df : `pandas.DataFrame` + Data frame with ``start_time_col`` and ``end_time_col``. + new_col : `str` + The column name for the new dummy column to be added. + label : `str` + The label to be used in the new column when an event / anomaly is + happening. Other values will be None. + + + Returns + ------- + labels_df : `pandas.DataFrame` + A dataframe which has one extra column as compared with the input ``labels_df``. + This extra column is a dummy string column for the input ``group``. + """ + for i, row in periods_df.iterrows(): + t1 = row[start_time_col] + t2 = row[end_time_col] + if t2 < t1: + raise ValueError( + f"End Time: {t2} cannot be before Start Time: {t1}, in ``periods_df``.") + bool_index = (ts >= t1) & (ts <= t2) + labels_df.loc[bool_index, new_col] = label + + return labels_df + + new_cols = [] + # If there is no grouping column we add a grouping column with only one value ("metric") + if grouping_col is None: + grouping_col = "metric" + periods_df["metric"] = "metric" + + groups = set(periods_df[grouping_col].values) + + marker_colors = get_distinct_colors( + len(groups), + opacity=0.8) + + for group in groups: + new_col = f"{group}{new_cols_tag}" + new_cols.append(new_col) + periods_df_group = periods_df.loc[ + periods_df[grouping_col] == group] + labels_df = add_periods_dummy_column_for_one_group( + labels_df=labels_df, + periods_df=periods_df_group, + new_col=new_col, + label=group) + + # Plotting line segments for each period. + # The line segments will have the same color for the same group. + # Each group will be occupying one horizontal level in the plot. + # The groups are stacked vertically in the plot so that the user can + # scan and compare the periods across groups. + fig = plot_lines_markers( + df=labels_df, + x_col=TIME_COL, + line_cols=None, + marker_cols=new_cols, + line_colors=None, + marker_colors=marker_colors) + + # Specify the y axis range + fig.update_yaxes(range=[-1, len(groups)]) + + # We add rectangles for each event period. + # The rectangle colors for each group will be the same and consistent with + # the line segments generated before for the same group. + # The rectangles span all the way through the y-axis so that the user can + # inspect the intersections between various groups better. + shapes = [] + for i, group in enumerate(groups): + ind = (periods_df[grouping_col] == group) + periods_df_group = periods_df.loc[ind] + + fillcolor = marker_colors[i] + for j, row in periods_df_group.iterrows(): + x0 = row[start_time_col] + x1 = row[end_time_col] + y0 = -1 + y1 = len(groups) + + # Specify the corners of the rectangles + shape = dict( + type="rect", + x0=x0, + y0=y0, + x1=x1, + y1=y1, + fillcolor=fillcolor, + opacity=0.6, + line_width=2, + line_color=fillcolor, + layer="below") + + shapes.append(shape) + + fig.update_layout( + shapes=shapes, + title=title, + plot_bgcolor="rgba(233, 233, 233, 0.3)") # light grey (lighter than default background) + + return { + "fig": fig, + "labels_df": labels_df, + "groups": groups, + "new_cols": new_cols, + "ts": ts, + "min_timestamp": min_timestamp, + "max_timestamp": max_timestamp, + "marker_colors": marker_colors, + "periods_df": periods_df + } + + +def add_multi_vrects( + fig, + periods_df, + grouping_col=None, + start_time_col=START_TIME_COL, + end_time_col=END_TIME_COL, + y_min=None, + y_max=None, + annotation_text_col=None, + annotation_position="top left", + opacity=0.5, + grouping_color_dict=None): + """Adds vertical rectangle shadings to existing figure. + Each vertical rectangle information is given in rows of data frame ``periods_df``. + The information includes the beginning and end of vertical ranges of each rectangle as well as optional + annotation. + Rectangle colors can be grouped using a grouping given in ``grouping_col`` if available. + + Parameters + ---------- + fig : `plotly.graph_objects.Figure` + Existing plotly object which is going to be augmented with vertical + rectangles and annotations. + periods_df : `pandas.DataFrame` + Data frame with at least ``start_time_col`` and ``end_time_col`` to denote the + beginning and end of each vertical rectangle. + This might also include ``grouping_col`` if the rectangle colors are to be grouped + into same color within the group. + grouping_col : `str` or None, default None + The column which is used for grouping the vertical rectangle colors. + For each group, the same color will be used for all periods in that group. + If None, a dummy column will be added ("metric") with a single value (also "metric") + which results in generating only one color for all rectangles. + start_time_col : `str`, default ``START_TIME_COL`` + The column denoting the start of a period. The type can be any type + consistent with the type of existing x axis in ``fig``. + end_time_col : `str`, default ``END_TIME_COL`` + The column denoting the start of a period. The type can be any type + consistent with the type of existing x axis in ``fig``. + y_min : `float` or None, default None + The lower limit of the rectangles. + y_max : `float` or None, default None + The upper limit of the rectangles. + annotation_text_col : `str` or None, default None + A column which includes annotation texts for each vertical rectangle. + annotation_position : `str`, default "top left" + The position of annotation texts with respect to the vertical rectangle. + opacity : `float`, default 0.5 + The opacity of the colors. Note that the passed colors could have opacity + as well, in which case this opacity will act as a relative opacity. + grouping_color_dict : `dict` [`str`, `str`] or None, default None + A dictionary to specify colors for each group given in ``grouping_col``. + If there is no ``grouping_col`` passed, there will be only one color needed + and in that case a dummy ``grouping_col`` will be created with the name "metric". + Therefore user needs to specify ``grouping_color_dict = {"metric": desired_color}``. + + Returns + ------- + result : `dict` + - "fig": `plotly.graph_objects.Figure` + Updated plotly object which is augmented with vertical + rectangles and annotations. + - "grouping_color_dict": `dict` [`str`, `str`] + A dictionary with keys being the groups and the values being the colors. + + """ + # If there is no grouping column we add a grouping column with only one value ("metric") + if grouping_col is None: + grouping_col = "metric" + periods_df["metric"] = "metric" + + if start_time_col not in periods_df.columns: + raise ValueError( + f"start_time_col: {start_time_col} is not found in ``periods_df`` columns: {periods_df.columns}") + + if end_time_col not in periods_df.columns: + raise ValueError( + f"end_time_col: {end_time_col} is not found in ``periods_df`` columns: {periods_df.columns}") + + if grouping_col is not None and grouping_col not in periods_df.columns: + raise ValueError( + f"grouping_col: {grouping_col} is passed but not found in ``periods_df`` columns: {periods_df.columns}") + + if annotation_text_col is not None and annotation_text_col not in periods_df.columns: + raise ValueError( + f"annotation_text_col: {annotation_text_col} is passed but not found in ``periods_df`` columns: {periods_df.columns}") + + groups = list(set(periods_df[grouping_col])) + groups.sort() + + if grouping_color_dict is None: + colors = get_distinct_colors( + len(groups), + opacity=1.0) + grouping_color_dict = {groups[i]: colors[i] for i in range(len(groups))} + + for i, group in enumerate(groups): + ind = (periods_df[grouping_col] == group) + periods_df_group = periods_df.loc[ind] + + fillcolor = grouping_color_dict[group] + for j, row in periods_df_group.iterrows(): + x0 = row[start_time_col] + x1 = row[end_time_col] + + if annotation_text_col is not None: + annotation_text = row[annotation_text_col] + else: + annotation_text = "" + # Adds the vertical rectangles + fig.add_vrect( + x0=x0, + y0=y_min, + x1=x1, + y1=y_max, + fillcolor=fillcolor, + opacity=opacity, + line_width=2, + line_color=fillcolor, + layer="below", + annotation_text=annotation_text, + annotation_position=annotation_position) + return { + "fig": fig, + "grouping_color_dict": grouping_color_dict} + + +def plot_overlay_anomalies_multi_metric( + df, + time_col, + value_cols, + anomaly_df, + anomaly_df_grouping_col=None, + start_time_col=START_TIME_COL, + end_time_col=END_TIME_COL, + annotation_text_col=None, + annotation_position="top left", + lines_opacity=0.6, + markers_opacity=0.8, + vrect_opacity=0.3): + """This function operates on a given data frame (``df``) which includes time (given in ``time_col``) and + metrics (given in ``value_cols``), as well as ``anomaly_df`` which includes the anomaly periods + corresponding to those metrics. It generates a plot of the metrics annotated with anomaly + values as markers on the curves and vertical rectangles for the same periods. + Each metric, its anomaly values and vertical rectangles use the same color with + varying opacity. + + Parameters + ---------- + df : `pandas.DataFrame` + A data frame which at least inludes a timestamp column (``TIME_COL``) and + ``value_cols`` which represent the metrics. + time_col : `str` + The column name in ``df`` representing time for the time series data. + The time column can be anything that can be parsed by `pandas.DatetimeIndex`. + value_cols : `list` [`str`] + The columns which include the metrics. + anomaly_df : `pandas.DataFrame` + Data frame with ``start_time_col`` and ``end_time_col`` and ``grouping_col`` + (if provided). This contains the anomaly periods for each metric + (one of the ``value_cols``). Each row of this dataframe corresponds + to an anomaly occurring between the times given in ``row[start_time_col]`` + and ``row[end_time_col]``. + The ``grouping_col`` (if not None) determines which metric that + anomaly corresponds too (otherwise we assume all anomalies apply to all metrics). + anomaly_df_grouping_col : `str` or None, default None + The column name for grouping the list of the anomalies which is to appear + in ``anomaly_df``. + This column should include some of the metric names + specified in ``value_cols``. The ``grouping_col`` (if not None) determines which metric that + anomaly corresponds too (otherwise we assume all anomalies apply to all metrics). + start_time_col : `str`, default ``START_TIME_COL`` + The column name in ``anomaly_df`` representing the start timestamp of + the anomalous period, inclusive. + The format can be anything that can be parsed by pandas DatetimeIndex. + end_time_col : `str`, default ``END_TIME_COL`` + The column name in ``anomaly_df`` representing the start timestamp of + the anomalous period, inclusive. + The format can be anything that can be parsed by pandas DatetimeIndex. + annotation_text_col : `str` or None, default None + A column which includes annotation texts for each vertical rectangle. + annotation_position : `str`, default "top left" + The position of annotation texts with respect to the vertical rectangle. + lines_opacity : `float`, default 0.6 + The opacity of the colors used in the lines (curves) which represent the + metrics given in ``value_cols``. + markers_opacity: `float`, default 0.8 + The opacity of the colors used in the markersc which represent the + value of the metrics given in ``value_cols`` during anomaly times as + specified in ``anomaly_df``. + vrect_opacity : `float`, default 0.3 + The opacity of the colors for the vertical rectangles. + + + Returns + ------- + result : `dict` + A dictionary with following items: + + - "fig": `plotly.graph_objects.Figure` + Plotly object which includes the metrics augmented with vertical + rectangles and annotations. + - "augmented_df": `pandas.DataFrame` + This is a dataframe obtained by augmenting the input ``df`` with new + columns determining if the metrics appearing in ``df`` are anomaly + or not and the new columns denoting anomaly values and normal values + (described below). + - "is_anomaly_cols": `list` [`str`] + The list of add boolean columns to determine if a value is an anomaly for + a given metric. The format of the columns is ``f"{metric}_is_anomaly"``. + - "anomaly_value_cols": `list` [`str`] + The list of columns containing only anomaly values (`np.nan` otherwise) for each corresponding + metric. The format of the columns is ``f"{metric}_anomaly_value"``. + - "normal_value_cols": `list` [`str`] + The list of columns containing only non-anomalous / normal values (`np.nan` otherwise) + for each corresponding metric. The format of the columns is ``f"{metric}_normal_value"``. + - "line_colors": `list` [`str`] + The colors generated for the metric lines (curves). + - "marker_colors": `list` [`str`] + The colors generated for the anomaly values markers. + - "vrect_colors": `list` [`str`] + The colors generated for the vertical rectangles. + + """ + # Adds anomaly information columns to the data + # For every column specified in ``value_cols``, there will be 3 new columns are added to ``df``: + # ``f"{value_col}_is_anomaly"`` + # ``f"{value_col}_anomaly_value"`` + # ``f"{value_col}_normal_value"`` + augmenting_data_res = label_anomalies_multi_metric( + df=df, + time_col=time_col, + value_cols=value_cols, + anomaly_df=anomaly_df, + anomaly_df_grouping_col=anomaly_df_grouping_col, + start_time_col=start_time_col, + end_time_col=end_time_col) + + augmented_df = augmenting_data_res["augmented_df"] + is_anomaly_cols = augmenting_data_res["is_anomaly_cols"] + anomaly_value_cols = augmenting_data_res["anomaly_value_cols"] + normal_value_cols = augmenting_data_res["normal_value_cols"] + + line_colors = get_distinct_colors( + len(value_cols), + opacity=lines_opacity) + + marker_colors = get_distinct_colors( + len(value_cols), + opacity=markers_opacity) + + vrect_colors = get_distinct_colors( + len(value_cols), + opacity=vrect_opacity) + + fig = plot_lines_markers( + df=augmented_df, + x_col=time_col, + line_cols=value_cols, + marker_cols=anomaly_value_cols, + line_colors=line_colors, + marker_colors=marker_colors) + + grouping_color_dict = {value_cols[i]: vrect_colors[i] for i in range(len(value_cols))} + + y_min = df[value_cols].min(numeric_only=True).min() + y_max = df[value_cols].max(numeric_only=True).max() + + augmenting_fig_res = add_multi_vrects( + fig=fig, + periods_df=anomaly_df, + grouping_col=anomaly_df_grouping_col, + start_time_col=start_time_col, + end_time_col=end_time_col, + y_min=y_min, + y_max=y_max, + annotation_text_col=annotation_text_col, + annotation_position="top left", + opacity=1.0, + grouping_color_dict=grouping_color_dict) + + fig = augmenting_fig_res["fig"] + + return { + "fig": fig, + "augmented_df": augmented_df, + "is_anomaly_cols": is_anomaly_cols, + "anomaly_value_cols": anomaly_value_cols, + "normal_value_cols": normal_value_cols, + "line_colors": line_colors, + "marker_colors": marker_colors, + "vrect_colors": vrect_colors + } + + +def plot_precision_recall_curve( + df, + grouping_col=None, + recall_col="recall", + precision_col="precision", + axis_font_size=18, + title_font_size=20, + title="Precision - Recall Curve", + opacity=0.95): + """Plots a Precision - Recall curve, where the x axis is recall and the y axis is precision. + If ``grouping_col`` is None, it creates one Precision - Recall curve given the data in ``df``. + Otherwise, this function creates an overlay plot for multiple Precision - Recall curves, one for each level in the ``grouping_col``. + + Parameters + ---------- + df : `pandas.DataFrame` + The input dataframe. Must contain the columns: + + - ``recall_col``: `float` + - ``precision_col``: `float` + + If ``grouping_col`` is not None, it must also contain the column ``grouping_col``. + grouping_col : `str` or None, default None + Column name for the grouping column. + recall_col : `str`, default "recall" + Column name for recall. + precision_col : `str`, default "precision" + Column name for precision. + axis_font_size : `int`, default 18 + Axis font size. + title_font_size : 20 + Title font size. + title : `str`, default "Precision - Recall Curve" + Plot title. + opacity : `float`, default 0.95 + The opacity of the color. This has to be a number between 0 and 1. + + Returns + ------- + fig : `plotly.graph_objs._figure.Figure` + Plot figure. + """ + if any([col not in df.columns for col in [recall_col, precision_col]]): + raise ValueError(f"`df` must contain the `recall_col`: '{recall_col}' and the `precision_col`: '{precision_col}' specified!") + # Stores the curves to be plotted. + data = [] + # Creates the curve(s). + if grouping_col is None: # Creates one precision - recall curve. + num_colors = 1 + df.sort_values(recall_col, inplace=True) + line = go.Scatter( + x=df[recall_col].tolist(), + y=df[precision_col].tolist()) + data.append(line) + else: # Creates precision - recall curve for every level in `grouping_col`. + if grouping_col not in df.columns: + raise ValueError(f"`grouping_col` = '{grouping_col}' is not found in the columns of `df`!") + num_colors = 0 + for level, indices in df.groupby(grouping_col).groups.items(): + df_subset = df.loc[indices].reset_index(drop=True).sort_values(recall_col) + line = go.Scatter( + name=f"{level}", + x=df_subset[recall_col].tolist(), + y=df_subset[precision_col].tolist()) + data.append(line) + num_colors += 1 + # Creates a list of colors for the curve(s). + color_list = get_distinct_colors( + num_colors=num_colors, + opacity=opacity) + if color_list is not None: + if len(color_list) < len(data): + raise ValueError("`color_list` must not be shorter than the number of traces in this figure!") + for i, v in enumerate(data): + v.line.color = color_list[i] + # Creates the layout. + range_epsilon = 0.05 # Space at the beginning and end of the margins. + layout = go.Layout( + xaxis=dict( + title=recall_col.title(), + titlefont=dict(size=axis_font_size), + range=[0 - range_epsilon, 1 + range_epsilon], # Sets the range of xaxis. + tickfont_size=axis_font_size, + tickformat=".0%", + hoverformat=",.1%"), # Keeps 1 decimal place. + yaxis=dict( + title=precision_col.title(), + titlefont=dict(size=axis_font_size), + range=[0 - range_epsilon, 1 + range_epsilon], # Sets the range of yaxis. + tickfont_size=axis_font_size, + tickformat=".0%", + hoverformat=",.1%"), # Keeps 1 decimal place. + title=title.title(), + title_x=0.5, + titlefont=dict(size=title_font_size), + autosize=False, + width=1000, + height=800) + # Creates the figure. + fig = go.Figure(data=data, layout=layout) + fig.update_yaxes( + constrain="domain", # Compresses the yaxis by decreasing its "domain". + automargin=True, + rangemode="tozero") + fig.update_xaxes( + constrain="domain", # Compresses the xaxis by decreasing its "domain". + automargin=True, + rangemode="tozero") + fig.add_hline(y=0.0, line_width=1, line_color="gray") + fig.add_vline(x=0.0, line_width=1, line_color="gray") + return fig + + +def plot_anomalies_over_forecast_vs_actual( + df, + time_col=TIME_COL, + actual_col=ACTUAL_COL, + predicted_col=PREDICTED_COL, + predicted_anomaly_col=PREDICTED_ANOMALY_COL, + anomaly_col=ANOMALY_COL, + marker_opacity=0.7, + predicted_anomaly_marker_color="green", + anomaly_marker_color="red", + **kwargs): + """Utility function which overlayes the predicted anomalies or anomalies on the forecast vs actual plot. + The function calls the internal function `~greykite.common.viz.timeseries_plotting.plot_forecast_vs_actual` + and then adds markers on top. + + Parameters + ---------- + df : `pandas.DataFrame` + The input dataframe. + time_col : `str`, default `~greykite.common.constants.TIME_COL` + Column in ``df`` with timestamp (x-axis). + actual_col : `str`, default `~greykite.common.constants.ACTUAL_COL` + Column in ``df`` with actual values. + predicted_col : `str`, default `~greykite.common.constants.PREDICTED_COL` + Column in ``df`` with predicted values. + predicted_anomaly_col : `str` or None, default `~greykite.common.constants.PREDICTED_ANOMALY_COL` + Column in ``df`` with predicted anomaly labels (boolean) in the time series. + `True` denotes a predicted anomaly. + anomaly_col : `str` or None, default `~greykite.common.constants.ANOMALY_COL` + Column in ``df`` with anomaly labels (boolean) in the time series. + `True` denotes an anomaly. + marker_opacity : `float`, default 0.5 + The opacity of the marker colors. + predicted_anomaly_marker_color : `str`, default "green" + The color of the marker(s) for the predicted anomalies. + anomaly_marker_color : `str`, default "red" + The color of the marker(s) for the anomalies. + **kwargs + Additional arguments on how to decorate your plot. + The keyword arguments are passed to `~greykite.common.viz.timeseries_plotting.plot_forecast_vs_actual`. + + Returns + ------- + fig : `plotly.graph_objs._figure.Figure` + Plot figure. + """ + fig = plot_forecast_vs_actual( + df=df, + time_col=time_col, + actual_col=actual_col, + predicted_col=predicted_col, + **kwargs) + if predicted_anomaly_col is not None: + fig.add_trace(go.Scatter( + x=df.loc[df[predicted_anomaly_col].apply(lambda val: val is True), time_col], + y=df.loc[df[predicted_anomaly_col].apply(lambda val: val is True), predicted_col], + mode="markers", + marker=go.scatter.Marker(color=predicted_anomaly_marker_color), + name=predicted_anomaly_col.title(), + showlegend=True, + opacity=marker_opacity)) + if anomaly_col is not None: + fig.add_trace(go.Scatter( + x=df.loc[df[anomaly_col].apply(lambda val: val is True), time_col], + y=df.loc[df[anomaly_col].apply(lambda val: val is True), actual_col], + mode="markers", + marker=go.scatter.Marker(color=anomaly_marker_color), + name=anomaly_col.title(), + showlegend=True, + opacity=marker_opacity)) + return fig diff --git a/greykite/common/viz/timeseries_plotting.py b/greykite/common/viz/timeseries_plotting.py index 62d8034..ccb3916 100644 --- a/greykite/common/viz/timeseries_plotting.py +++ b/greykite/common/viz/timeseries_plotting.py @@ -27,6 +27,7 @@ import pandas as pd import plotly.graph_objects as go from plotly.colors import DEFAULT_PLOTLY_COLORS +from plotly.subplots import make_subplots from greykite.common import constants as cst from greykite.common.features.timeseries_features import build_time_features_df @@ -34,6 +35,7 @@ from greykite.common.logging import log_message from greykite.common.python_utils import update_dictionary from greykite.common.viz.colors_utils import get_color_palette +from greykite.common.viz.colors_utils import get_distinct_colors def plot_multivariate( @@ -1031,3 +1033,224 @@ def flexible_grouping_evaluation( f"{list_cols}.") return df_transformed + + +def plot_dual_axis_figure( + df, + x_col, + y_left_col, + y_right_col, + grouping_col=None, + xlabel=None, + ylabel_left=None, + ylabel_right=None, + title=None, + y_left_linestyle="solid", + y_right_linestyle="dash", + opacity=0.9, + axis_font_size=18, + title_font_size=20, + x_range=None, + y_left_range=None, + y_right_range=None, + x_tick_format=None, + y_left_tick_format=None, + y_right_tick_format=None, + x_hover_format=None, + y_left_hover_format=None, + y_right_hover_format=None, + group_color_dict=None): + """Generic function to plot a dual y-axis plot. The x-axis is specified by ``x_col``. + The left and right y-axes are specified by ``y_left_col`` and ``y_right_col`` respectively. + If ``grouping_col`` is specified, then multiple pairs of curves are drawn, one for each level in ``grouping_col``. + + Parameters + ---------- + df : `pandas.DataFrame` + The input dataframe. Must contain the columns ``x_col``, ``y_left_col`` and ``y_right_col``. + If ``grouping_col`` is not None, it must also contain the ``grouping_col`` column. + For example, the dataframe could look like this. + + +-----------+----------------+-----------------+------------------+ + | ``x_col`` | ``y_left_col`` | ``y_right_col`` | ``grouping_col`` | + +===========+================+=================+==================+ + | 1.10 | 20.12 | 0.21 | "A" | + +-----------+----------------+-----------------+------------------+ + | 1.40 | 40.31 | 0.43 | "A" | + +-----------+----------------+-----------------+------------------+ + | 1.23 | 63.21 | NaN | "B" | + +-----------+----------------+-----------------+------------------+ + | 1.54 | 10.31 | 0.12 | "B" | + +-----------+----------------+-----------------+------------------+ + | ... | ... | ... | ... | + +-----------+----------------+-----------------+------------------+ + + x_col : `str` + The column name of the column in ``df`` to be used for the x-axis. + y_left_col : `str` + The column name of the column in ``df`` to be used for the left y-axis. + y_right_col : `str` + The column name of the column in ``df`` to be used for the right y-axis. + grouping_col : `str` or None, default None + Name of the grouping column in ``df`` to be used for overlaying curves for each level in ``grouping_col``. + xlabel : `str` or None, default None + Name for the x-axis label. If it is `None`, then it is set to be ``x_col``. + ylabel_left : `str` or None, default None + Name for the left y-axis label. If it is `None`, then it is set to be ``y_left_col``. + ylabel_right : `str` or None, default None + Name for the right y-axis label. If it is `None`, then it is set to be ``y_right_col``. + title : `str` or None, default None + The title for the plot. + y_left_linestyle : `str`, default "solid" + Line style for the left y-axis curve. + y_right_linestyle : `str`, default "dash" + Line style for the right y-axis curve. + opacity : `float`, default 0.9 + The opacity of the colors. This has to be a number between 0 and 1. + axis_font_size : `int`, default 18 + The size of the axis fonts. + title_font_size : `int`, default 20 + The size of the title fonts. + x_range : `list` or None, default None + Range of the x-axis. + y_left_range : `list` or None, default None + Range of the left y-axis. + y_right_range : `list` or None, default None + Range of the right y-axis. + x_tick_format : `str` or None, default None + Format of the ticks on the x-axis. + y_left_tick_format : `str` or None, default None + Format of the ticks on the left y-axis. + y_right_tick_format : `str` or None, default None + Format of the ticks on the right y-axis. + x_hover_format : `str` or None, default None + Format of the values when hovering for the x-axis. + y_left_hover_format : `str` or None, default None + Format of the values when hovering for the left y-axis. + y_right_hover_format : `str` or None, default None + Format of the values when hovering for the right y-axis. + group_color_dict : `dict` [`str`, `str`] or None, default None. + Dictionary with a mapping from levels within the ``grouping_col`` and a specified color. + The keys are the levels in ``grouping_col`` and the values are a specified color. + If ``group_color_dict`` is `None`, the colors are generated using the function + `greykite.common.viz.colors_utils.get_distinct_colors`. + + Returns + ------- + fig : `plotly.graph_objects.Figure` + Dual y-axes plot. + """ + if any([col not in df.columns for col in [x_col, y_left_col, y_right_col]]): + raise ValueError(f"`df` must contain the columns: '{x_col}', '{y_left_col}' and '{y_right_col}'!") + + # If no custom labels are given, we simply use the names of the passed columns. + if xlabel is None: + xlabel = x_col + if ylabel_left is None: + ylabel_left = y_left_col + if ylabel_right is None: + ylabel_right = y_right_col + # Stores the data for the left and right curves. + y_left_data = [] + y_right_data = [] + # Creates the curve(s). + if grouping_col is None: # No `grouping_col` + # In this case, only one color is needed. + color = get_distinct_colors(num_colors=1, opacity=opacity)[0] + df = df.reset_index(drop=True).sort_values(x_col) + # Left lines. + line_left = go.Scatter( + name=ylabel_left, + x=df[x_col].tolist(), + y=df[y_left_col].tolist(), + showlegend=True, + line=dict( + dash=y_left_linestyle, + color=color)) + y_left_data.append(line_left) + # Right lines. + line_right = go.Scatter( + name=ylabel_right, + x=df[x_col].tolist(), + y=df[y_right_col].tolist(), + showlegend=True, + line=dict( + dash=y_right_linestyle, + color=color)) + y_right_data.append(line_right) + else: # `grouping_col` is not None. + # Gets the levels for the specified `grouping_col`. + levels = df.groupby(grouping_col).groups + # Assigns colors to levels if not specified. + if group_color_dict is None: + color_list = get_distinct_colors( + num_colors=len(levels), + opacity=opacity) + group_color_dict = {level: color_list[i] for i, level in enumerate(levels.keys())} + # Generates curves for each level. + for level, indices in levels.items(): + df_subset = df.loc[indices].reset_index(drop=True).sort_values(x_col) + # Left lines. + line_left = go.Scatter( + name=ylabel_left, + legendgroup=f"{grouping_col} = {level}", + legendgrouptitle_text=f"{grouping_col} = {level}", + x=df_subset[x_col].tolist(), + y=df_subset[y_left_col].tolist(), + showlegend=True, + line=dict( + dash=y_left_linestyle, + color=group_color_dict[level])) + y_left_data.append(line_left) + # Right lines. + line_right = go.Scatter( + name=ylabel_right, + legendgroup=f"{grouping_col} = {level}", + legendgrouptitle_text=f"{grouping_col} = {level}", + x=df_subset[x_col].tolist(), + y=df_subset[y_right_col].tolist(), + showlegend=True, + line=dict( + dash=y_right_linestyle, + color=group_color_dict[level])) + y_right_data.append(line_right) + + fig = make_subplots(specs=[[{"secondary_y": True}]]) + for line_left, line_right in zip(y_left_data, y_right_data): + fig.add_trace(line_left, secondary_y=False) + fig.add_trace(line_right, secondary_y=True) + # Updates figure layout. + fig.update_layout( + title_text=title, + titlefont=dict(size=title_font_size), + autosize=False, + width=1000, + height=800, + hovermode="x") + # Updates x-axis. + fig.update_xaxes( + title=xlabel, + titlefont=dict(size=axis_font_size), + range=x_range, + tickfont_size=axis_font_size, + tickformat=x_tick_format, + hoverformat=x_hover_format), + # Updates the left y-axis. + fig.update_yaxes( + title_text=ylabel_left, + secondary_y=False, + titlefont=dict(size=axis_font_size), + range=y_left_range, + tickfont_size=axis_font_size, + tickformat=y_left_tick_format, + hoverformat=y_left_hover_format) + # Updates the right y-axis. + fig.update_yaxes( + title_text=ylabel_right, + secondary_y=True, + titlefont=dict(size=axis_font_size), + range=y_right_range, + tickfont_size=axis_font_size, + tickformat=y_right_tick_format, + hoverformat=y_right_hover_format) + return fig diff --git a/greykite/framework/input/univariate_time_series.py b/greykite/framework/input/univariate_time_series.py index 3faf4c2..2d7f5a7 100644 --- a/greykite/framework/input/univariate_time_series.py +++ b/greykite/framework/input/univariate_time_series.py @@ -119,7 +119,7 @@ def __init__(self) -> None: self.last_date_for_val: Optional[datetime] = None self.last_date_for_reg: Optional[datetime] = None self.last_date_for_lag_reg: Optional[datetime] = None - self.train_end_date: Optional[str] = None + self.train_end_date: Optional[str, datetime] = None self.fit_cols: List[str] = [] self.fit_df: Optional[pd.DataFrame] = None self.fit_y: Optional[pd.DataFrame] = None @@ -132,12 +132,12 @@ def load_data( df: pd.DataFrame, time_col: str = TIME_COL, value_col: str = VALUE_COL, - freq: str = None, - date_format: str = None, - tz: str = None, - train_end_date: datetime = None, - regressor_cols: List[str] = None, - lagged_regressor_cols: List[str] = None, + freq: Optional[str] = None, + date_format: Optional[str] = None, + tz: Optional[str] = None, + train_end_date: Optional[Union[str, datetime]] = None, + regressor_cols: Optional[List[str]] = None, + lagged_regressor_cols: Optional[List[str]] = None, anomaly_info: Optional[Union[Dict, List[Dict]]] = None): """Loads data to internal representation. Parses date column, sets timezone aware index. @@ -162,7 +162,7 @@ def load_data( If None (recommended), inferred by `pandas.to_datetime`. tz : `str` or pytz.timezone object or None, default None Passed to `pandas.tz_localize` to localize the timestamp. - train_end_date : `datetime.datetime` or None, default None + train_end_date : `str` or `datetime.datetime` or None, default None Last date to use for fitting the model. Forecasts are generated after this date. If None, it is set to the minimum of ``self.last_date_for_val`` and ``self.last_date_for_reg``. @@ -227,7 +227,7 @@ def load_data( freq=freq, date_format=date_format, tz=tz, - train_end_date=train_end_date, + train_end_date=pd.to_datetime(train_end_date, format=date_format), regressor_cols=regressor_cols, lagged_regressor_cols=lagged_regressor_cols, anomaly_info=anomaly_info) diff --git a/greykite/framework/output/univariate_forecast.py b/greykite/framework/output/univariate_forecast.py index 936da4d..3bd8f98 100644 --- a/greykite/framework/output/univariate_forecast.py +++ b/greykite/framework/output/univariate_forecast.py @@ -35,6 +35,7 @@ from greykite.common.evaluation import fraction_outside_tolerance from greykite.common.evaluation import r2_null_model_score from greykite.common.python_utils import apply_func_to_columns +from greykite.common.time_properties import infer_freq from greykite.common.viz.timeseries_plotting import add_groupby_column from greykite.common.viz.timeseries_plotting import flexible_grouping_evaluation from greykite.common.viz.timeseries_plotting import grouping_evaluation @@ -169,7 +170,7 @@ def __init__( if test_start_date is None: # This expects no gaps in time column - inferred_freq = pd.infer_freq(self.df[self.time_col]) + inferred_freq = infer_freq(self.df, self.time_col) # Uses pd.date_range because pd.Timedelta does not work for complicated frequencies e.g. "W-MON" self.test_start_date = pd.date_range( start=self.train_end_date, diff --git a/greykite/framework/pipeline/utils.py b/greykite/framework/pipeline/utils.py index 34e0e97..e365cf4 100644 --- a/greykite/framework/pipeline/utils.py +++ b/greykite/framework/pipeline/utils.py @@ -775,7 +775,7 @@ def get_forecast( # combines actual with predictions union_df = pd.DataFrame({ - xlabel: df[cst.TIME_COL].values, + cst.TIME_COL: df[cst.TIME_COL].values, # .values here, since df and predicted_df have different indexes cst.ACTUAL_COL: df[cst.VALUE_COL].values, # evaluation and plots are done on the values *before* any transformations @@ -808,7 +808,7 @@ def get_forecast( return UnivariateForecast( union_df, - time_col=xlabel, + time_col=cst.TIME_COL, actual_col=cst.ACTUAL_COL, predicted_col=cst.PREDICTED_COL, predicted_lower_col=predicted_lower_col, diff --git a/greykite/framework/templates/auto_model_template.py b/greykite/framework/templates/auto_model_template.py index d7f8757..f1b592b 100644 --- a/greykite/framework/templates/auto_model_template.py +++ b/greykite/framework/templates/auto_model_template.py @@ -28,6 +28,7 @@ from greykite.common.logging import LoggingLevelEnum from greykite.common.logging import log_message +from greykite.common.time_properties import infer_freq from greykite.common.time_properties import min_gap_in_seconds from greykite.framework.pipeline.utils import get_default_time_parameters from greykite.framework.templates.autogen.forecast_config import ForecastConfig @@ -86,8 +87,9 @@ def get_auto_silverkite_model_template( metadata = config.metadata_param freq = metadata.freq if freq is None: - freq = pd.infer_freq( - df[metadata.time_col] + freq = infer_freq( + df, + metadata.time_col ) if freq is None: # NB: frequency inference fails if there are missing points in the input data diff --git a/greykite/framework/templates/autogen/forecast_config.py b/greykite/framework/templates/autogen/forecast_config.py index 8a14bc8..518aaf6 100644 --- a/greykite/framework/templates/autogen/forecast_config.py +++ b/greykite/framework/templates/autogen/forecast_config.py @@ -7,6 +7,7 @@ # # result = forecast_config_from_dict(json.loads(json_string)) +import json from dataclasses import dataclass from typing import Any from typing import Callable @@ -18,6 +19,8 @@ from typing import Union from typing import cast +from greykite.common.python_utils import assert_equal + T = TypeVar("T") @@ -101,6 +104,20 @@ def to_class(c: Type[T], x: Any) -> dict: return cast(Any, x).to_dict() +def from_list_float(x: Any) -> List[float]: + """Parses a list of floats""" + assert isinstance(x, list) + assert all(isinstance(item, (float, int)) and not isinstance(item, bool) for item in x) + return x + + +def from_list_list_str(x: Any) -> List[List[str]]: + """Parses a list that contains lists of strings""" + assert isinstance(x, list) + assert all(from_list_str(item) for item in x) + return x + + @dataclass class ComputationParam: """How to compute the result.""" @@ -414,11 +431,13 @@ def from_dict(obj: Any) -> 'ForecastConfig': forecast_one_by_one = from_union([from_int, from_bool, from_none, from_list_int], obj.get("forecast_one_by_one")) metadata_param = from_union([MetadataParam.from_dict, from_none], obj.get("metadata_param")) if not isinstance(obj.get("model_components_param"), list): - obj["model_components_param"] = [obj.get("model_components_param")] - model_components_param = [from_union([ModelComponentsParam.from_dict, from_none], mcp) for mcp in obj.get("model_components_param")] + model_components_param = from_union([ModelComponentsParam.from_dict, from_none], obj.get("model_components_param")) + else: + model_components_param = [from_union([ModelComponentsParam.from_dict, from_none], mcp) for mcp in obj.get("model_components_param")] if not isinstance(obj.get("model_template"), list): - obj["model_template"] = [obj.get("model_template")] - model_template = [from_union([from_str, from_none], mt) for mt in obj.get("model_template")] + model_template = from_union([from_str, from_none], obj.get("model_template")) + else: + model_template = [from_union([from_str, from_none], mt) for mt in obj.get("model_template")] return ForecastConfig( computation_param=computation_param, coverage=coverage, @@ -447,6 +466,18 @@ def to_dict(self) -> dict: result["model_template"] = [from_union([from_str, from_none], mt) for mt in self.model_template] return result + @staticmethod + def from_json(obj: Any) -> 'ForecastConfig': + """Converts a json string to the corresponding instance of the `ForecastConfig` class. + Raises ValueError if the input is not a json string. + """ + try: + forecast_dict = json.loads(obj) + except Exception: + raise ValueError(f"The input ({obj}) is not a json string.") + + return ForecastConfig.from_dict(forecast_dict) + def forecast_config_from_dict(s: Any) -> ForecastConfig: return ForecastConfig.from_dict(s) @@ -454,3 +485,31 @@ def forecast_config_from_dict(s: Any) -> ForecastConfig: def forecast_config_to_dict(x: ForecastConfig) -> Any: return to_class(ForecastConfig, x) + + +def assert_equal_forecast_config( + forecast_config_1: ForecastConfig, + forecast_config_2: ForecastConfig): + """Asserts equality between two instances of `ForecastConfig`. + Raises an error in case of parameter mismatch. + + Parameters + ---------- + forecast_config_1: `ForecastConfig` + First instance of the + :class:`~greykite.framework.templates.model_templates.ForecastConfig` for comparing. + forecast_config_2: `ForecastConfig` + Second instance of the + :class:`~greykite.framework.templates.model_templates.ForecastConfig` for comparing. + + Raises + ------- + AssertionError + If `ForecastConfig`s do not match, else returns None. + """ + if not isinstance(forecast_config_1, ForecastConfig): + raise ValueError(f"The input ({forecast_config_1}) is not a member of 'ForecastConfig' class.") + if not isinstance(forecast_config_2, ForecastConfig): + raise ValueError(f"The input ({forecast_config_2}) is not a member of 'ForecastConfig' class.") + + assert_equal(forecast_config_1.to_dict(), forecast_config_2.to_dict()) diff --git a/greykite/framework/templates/multistage_forecast_template_config.py b/greykite/framework/templates/multistage_forecast_template_config.py index dbeb4ee..8cea1ca 100644 --- a/greykite/framework/templates/multistage_forecast_template_config.py +++ b/greykite/framework/templates/multistage_forecast_template_config.py @@ -106,6 +106,8 @@ class MultistageForecastTemplateConfig: "holiday_post_num_days": 1, "holiday_pre_post_num_dict": None, "daily_event_df_dict": None, + "daily_event_neighbor_impact": None, + "daily_event_shifted_effect": None }, changepoints={ "changepoints_dict": { @@ -145,7 +147,8 @@ class MultistageForecastTemplateConfig: "drop_pred_cols": None, "explicit_pred_cols": None, "regression_weight_col": None, - "normalize_method": "zero_to_one" + "normalize_method": "zero_to_one", + "remove_intercept": False } ) ), @@ -176,6 +179,8 @@ class MultistageForecastTemplateConfig: "holiday_post_num_days": 0, "holiday_pre_post_num_dict": None, "daily_event_df_dict": None, + "daily_event_neighbor_impact": None, + "daily_event_shifted_effect": None }, changepoints={ "changepoints_dict": None, @@ -207,7 +212,8 @@ class MultistageForecastTemplateConfig: "drop_pred_cols": None, "explicit_pred_cols": None, "regression_weight_col": None, - "normalize_method": "zero_to_one" + "normalize_method": "zero_to_one", + "remove_intercept": False } ) ) @@ -268,6 +274,8 @@ class MultistageForecastTemplateConfig: "holiday_post_num_days": 0, # ignored "holiday_pre_post_num_dict": None, # ignored "daily_event_df_dict": None, # ignored + "daily_event_neighbor_impact": None, # ignored + "daily_event_shifted_effect": None }, changepoints={ "auto_growth": True, @@ -300,7 +308,8 @@ class MultistageForecastTemplateConfig: "drop_pred_cols": None, "explicit_pred_cols": None, "regression_weight_col": None, - "normalize_method": "zero_to_one" + "normalize_method": "zero_to_one", + "remove_intercept": False } ) ), diff --git a/greykite/framework/templates/silverkite_template.py b/greykite/framework/templates/silverkite_template.py index 9e1cc4a..d584834 100644 --- a/greykite/framework/templates/silverkite_template.py +++ b/greykite/framework/templates/silverkite_template.py @@ -44,7 +44,6 @@ from greykite.framework.templates.autogen.forecast_config import ModelComponentsParam from greykite.framework.templates.base_template import BaseTemplate from greykite.sklearn.estimator.base_forecast_estimator import BaseForecastEstimator -from greykite.sklearn.estimator.silverkite_diagnostics import SilverkiteDiagnostics from greykite.sklearn.estimator.silverkite_estimator import SilverkiteEstimator @@ -148,6 +147,8 @@ def apply_default_model_components( default_events = { "daily_event_df_dict": [None], + "daily_event_neighbor_impact": [None], + "daily_event_shifted_effect": [None] } model_components.events = update_dictionary( default_events, @@ -206,7 +207,6 @@ def apply_default_model_components( default_custom = { "silverkite": [SilverkiteForecast()], # NB: sklearn creates a copy in grid search - "silverkite_diagnostics": [SilverkiteDiagnostics()], # The same origin for every split, based on start year of full dataset. # To use first date of each training split, set to `None` in model_components. "origin_for_time_vars": [origin_for_time_vars], @@ -220,7 +220,8 @@ def apply_default_model_components( "min_admissible_value": [None], "max_admissible_value": [None], "regression_weight_col": [None], - "normalize_method": [None] + "normalize_method": [None], + "remove_intercept": [False] } model_components.custom = update_dictionary( default_custom, @@ -374,7 +375,7 @@ class SilverkiteTemplate(BaseTemplate): parameters in `~greykite.algo.forecast.silverkite.forecast_silverkite.SilverkiteForecast.forecast`. Allowed keys: - ``"silverkite"``, ``"silverkite_diagnostics"``, + ``"silverkite"``, ``"origin_for_time_vars"``, ``"extra_pred_cols"``, ``"drop_pred_cols"``, ``"explicit_pred_cols"``, ``"fit_algorithm_dict"``, ``"min_admissible_value"``, @@ -593,7 +594,6 @@ def get_hyperparameter_grid(self): # returns a single set of parameters for grid search hyperparameter_grid = { "estimator__silverkite": self.config.model_components_param.custom["silverkite"], - "estimator__silverkite_diagnostics": self.config.model_components_param.custom["silverkite_diagnostics"], "estimator__origin_for_time_vars": self.config.model_components_param.custom["origin_for_time_vars"], "estimator__extra_pred_cols": self.config.model_components_param.custom["extra_pred_cols"], "estimator__drop_pred_cols": self.config.model_components_param.custom["drop_pred_cols"], @@ -602,6 +602,8 @@ def get_hyperparameter_grid(self): "estimator__training_fraction": [None], "estimator__fit_algorithm_dict": self.config.model_components_param.custom["fit_algorithm_dict"], "estimator__daily_event_df_dict": self.config.model_components_param.events["daily_event_df_dict"], + "estimator__daily_event_neighbor_impact": self.config.model_components_param.events["daily_event_neighbor_impact"], + "estimator__daily_event_shifted_effect": self.config.model_components_param.events["daily_event_shifted_effect"], "estimator__fs_components_df": self.config.model_components_param.seasonality["fs_components_df"], "estimator__autoreg_dict": self.config.model_components_param.autoregression["autoreg_dict"], "estimator__simulation_num": self.config.model_components_param.autoregression["simulation_num"], @@ -614,6 +616,7 @@ def get_hyperparameter_grid(self): "estimator__max_admissible_value": self.config.model_components_param.custom["max_admissible_value"], "estimator__normalize_method": self.config.model_components_param.custom["normalize_method"], "estimator__regression_weight_col": self.config.model_components_param.custom["regression_weight_col"], + "estimator__remove_intercept": self.config.model_components_param.custom["remove_intercept"], "estimator__uncertainty_dict": self.config.model_components_param.uncertainty["uncertainty_dict"], } @@ -630,7 +633,10 @@ def get_hyperparameter_grid(self): hyperparameter_grid = dictionaries_values_to_lists( hyperparameter_grid, hyperparameters_list_type={ - "estimator__extra_pred_cols": [None]} + "estimator__extra_pred_cols": [None], + "estimator__drop_pred_cols": [None], + "estimator__explicit_pred_cols": [None] + } ) return hyperparameter_grid diff --git a/greykite/framework/templates/simple_silverkite_template.py b/greykite/framework/templates/simple_silverkite_template.py index ea73f07..756f735 100644 --- a/greykite/framework/templates/simple_silverkite_template.py +++ b/greykite/framework/templates/simple_silverkite_template.py @@ -554,7 +554,6 @@ class SimpleSilverkiteTemplate(BaseTemplate): hyperparameter_override={ "estimator__silverkite": SimpleSilverkiteForecast(), - "estimator__silverkite_diagnostics": SilverkiteDiagnostics(), "estimator__growth_term": "linear", "input__response__null__impute_algorithm": "ts_interpolate", "input__response__null__impute_params": {"orders": [7, 14]}, @@ -845,7 +844,11 @@ def get_hyperparameter_grid(self): "estimator__holidays_to_model_separately": [None, "auto", self._silverkite_holiday.ALL_HOLIDAYS_IN_COUNTRIES], "estimator__holiday_lookup_countries": [None, "auto"], "estimator__regressor_cols": [None], - "estimator__extra_pred_cols": [None]} + "estimator__extra_pred_cols": [None], + "estimator__drop_pred_cols": [None], + "estimator__explicit_pred_cols": [None], + "estimator__daily_event_shifted_effect": [None] + } ) hyperparameter_grid_result = unique_dict_in_list(hyperparameter_grid_result) if len(hyperparameter_grid_result) == 1: @@ -1127,7 +1130,8 @@ def __get_single_model_components_param_from_template(self, template): "min_admissible_value": None, "max_admissible_value": None, "regression_weight_col": None, - "normalize_method": "zero_to_one" + "normalize_method": "zero_to_one", + "remove_intercept": False }, autoregression=self.constants.COMMON_MODELCOMPONENTPARAM_PARAMETERS["AR"][components[components.index("AR")+1]], regressors={ @@ -1276,6 +1280,8 @@ def __get_hyperparameter_grid_from_model_components(model_components): "estimator__holiday_post_num_days": model_components.events["holiday_post_num_days"], "estimator__holiday_pre_post_num_dict": model_components.events["holiday_pre_post_num_dict"], "estimator__daily_event_df_dict": model_components.events["daily_event_df_dict"], + "estimator__daily_event_neighbor_impact": model_components.events["daily_event_neighbor_impact"], + "estimator__daily_event_shifted_effect": model_components.events["daily_event_shifted_effect"], "estimator__feature_sets_enabled": model_components.custom["feature_sets_enabled"], "estimator__fit_algorithm_dict": model_components.custom["fit_algorithm_dict"], "estimator__max_daily_seas_interaction_order": model_components.custom["max_daily_seas_interaction_order"], @@ -1292,6 +1298,7 @@ def __get_hyperparameter_grid_from_model_components(model_components): "estimator__regressor_cols": model_components.regressors["regressor_cols"], "estimator__lagged_regressor_dict": model_components.lagged_regressors["lagged_regressor_dict"], "estimator__regression_weight_col": model_components.custom["regression_weight_col"], + "estimator__remove_intercept": model_components.custom["remove_intercept"], "estimator__uncertainty_dict": model_components.uncertainty["uncertainty_dict"] } return hyperparameter_grid diff --git a/greykite/framework/templates/simple_silverkite_template_config.py b/greykite/framework/templates/simple_silverkite_template_config.py index dafcb74..380a855 100644 --- a/greykite/framework/templates/simple_silverkite_template_config.py +++ b/greykite/framework/templates/simple_silverkite_template_config.py @@ -513,6 +513,8 @@ class SimpleSilverkiteTemplateOptions: "holiday_post_num_days": 1, "holiday_pre_post_num_dict": None, "daily_event_df_dict": None, + "daily_event_neighbor_impact": None, + "daily_event_shifted_effect": None }, SP2={ "auto_holiday": False, @@ -522,6 +524,8 @@ class SimpleSilverkiteTemplateOptions: "holiday_post_num_days": 2, "holiday_pre_post_num_dict": None, "daily_event_df_dict": None, + "daily_event_neighbor_impact": None, + "daily_event_shifted_effect": None }, SP4={ "auto_holiday": False, @@ -531,6 +535,8 @@ class SimpleSilverkiteTemplateOptions: "holiday_post_num_days": 4, "holiday_pre_post_num_dict": None, "daily_event_df_dict": None, + "daily_event_neighbor_impact": None, + "daily_event_shifted_effect": None }, TG={ "auto_holiday": False, @@ -540,6 +546,8 @@ class SimpleSilverkiteTemplateOptions: "holiday_post_num_days": 3, "holiday_pre_post_num_dict": None, "daily_event_df_dict": None, + "daily_event_neighbor_impact": None, + "daily_event_shifted_effect": None }, NONE={ "auto_holiday": False, @@ -549,6 +557,8 @@ class SimpleSilverkiteTemplateOptions: "holiday_post_num_days": 0, "holiday_pre_post_num_dict": None, "daily_event_df_dict": None, + "daily_event_neighbor_impact": None, + "daily_event_shifted_effect": None }), # Feature sets enabled. FEASET=dict( @@ -645,6 +655,8 @@ class SimpleSilverkiteTemplateOptions: "holiday_post_num_days": 2, "holiday_pre_post_num_dict": None, "daily_event_df_dict": None, + "daily_event_neighbor_impact": None, + "daily_event_shifted_effect": None }, changepoints={ "auto_growth": False, @@ -687,7 +699,8 @@ class SimpleSilverkiteTemplateOptions: "min_admissible_value": None, "max_admissible_value": None, "regression_weight_col": None, - "normalize_method": "zero_to_one" + "normalize_method": "zero_to_one", + "remove_intercept": False } ) """Defines the ``SILVERKITE`` template. Contains automatic growth, @@ -726,6 +739,8 @@ class SimpleSilverkiteTemplateOptions: "holiday_post_num_days": 2, "holiday_pre_post_num_dict": None, "daily_event_df_dict": None, + "daily_event_neighbor_impact": None, + "daily_event_shifted_effect": None }, changepoints={ "auto_growth": False, @@ -768,7 +783,8 @@ class SimpleSilverkiteTemplateOptions: "min_admissible_value": None, "max_admissible_value": None, "regression_weight_col": None, - "normalize_method": "zero_to_one" + "normalize_method": "zero_to_one", + "remove_intercept": False } ) @@ -792,6 +808,8 @@ class SimpleSilverkiteTemplateOptions: "holiday_post_num_days": 2, "holiday_pre_post_num_dict": None, "daily_event_df_dict": None, + "daily_event_neighbor_impact": None, + "daily_event_shifted_effect": None }, changepoints={ "auto_growth": False, @@ -834,7 +852,8 @@ class SimpleSilverkiteTemplateOptions: "min_admissible_value": None, "max_admissible_value": None, "regression_weight_col": None, - "normalize_method": "zero_to_one" + "normalize_method": "zero_to_one", + "remove_intercept": False } ) @@ -858,6 +877,8 @@ class SimpleSilverkiteTemplateOptions: "holiday_post_num_days": 2, "holiday_pre_post_num_dict": None, "daily_event_df_dict": None, + "daily_event_neighbor_impact": None, + "daily_event_shifted_effect": None }, changepoints={ "auto_growth": False, @@ -900,7 +921,8 @@ class SimpleSilverkiteTemplateOptions: "min_admissible_value": None, "max_admissible_value": None, "regression_weight_col": None, - "normalize_method": "zero_to_one" + "normalize_method": "zero_to_one", + "remove_intercept": False } ) @@ -925,6 +947,8 @@ class SimpleSilverkiteTemplateOptions: "holiday_post_num_days": 0, "holiday_pre_post_num_dict": None, "daily_event_df_dict": None, + "daily_event_neighbor_impact": None, + "daily_event_shifted_effect": None }, changepoints={ "auto_growth": False, @@ -974,7 +998,8 @@ class SimpleSilverkiteTemplateOptions: "min_admissible_value": None, "max_admissible_value": None, "regression_weight_col": None, - "normalize_method": "zero_to_one" + "normalize_method": "zero_to_one", + "remove_intercept": False } ) """Defines the ``SILVERKITE_MONTHLY`` template. Contains automatic growth. diff --git a/greykite/framework/utils/exploratory_data_analysis.py b/greykite/framework/utils/exploratory_data_analysis.py index 8c5ac05..c6806d3 100644 --- a/greykite/framework/utils/exploratory_data_analysis.py +++ b/greykite/framework/utils/exploratory_data_analysis.py @@ -25,6 +25,9 @@ from io import BytesIO import matplotlib.pyplot as plt +from statsmodels.graphics.tsaplots import plot_acf +from statsmodels.graphics.tsaplots import plot_pacf + from greykite.algo.changepoint.adalasso.changepoint_detector import ChangepointDetector from greykite.algo.common.holiday_inferrer import HolidayInferrer from greykite.common.constants import TIME_COL @@ -32,9 +35,6 @@ from greykite.common.enums import SeasonalityEnum from greykite.common.time_properties import min_gap_in_seconds from greykite.common.time_properties_forecast import get_simple_time_frequency_from_period -from statsmodels.graphics.tsaplots import plot_acf -from statsmodels.graphics.tsaplots import plot_pacf - from greykite.framework.input.univariate_time_series import UnivariateTimeSeries diff --git a/greykite/sklearn/estimator/auto_arima_estimator.py b/greykite/sklearn/estimator/auto_arima_estimator.py index fd3c890..ac62c90 100644 --- a/greykite/sklearn/estimator/auto_arima_estimator.py +++ b/greykite/sklearn/estimator/auto_arima_estimator.py @@ -34,6 +34,7 @@ from greykite.common.constants import PREDICTED_UPPER_COL from greykite.common.constants import TIME_COL from greykite.common.constants import VALUE_COL +from greykite.common.time_properties import infer_freq from greykite.sklearn.estimator.base_forecast_estimator import BaseForecastEstimator @@ -294,7 +295,7 @@ def predict(self, X, y=None): fut_reg_df = fut_df[self.regressor_cols] # Auto-arima only accepts regressor values beyond `fit_df` if self.freq is None: - self.freq = pd.infer_freq(self.fit_df[self.time_col_]) + self.freq = infer_freq(self.fit_df, self.time_col_) if self.freq == "MS": timedelta_freq = "M" # `to_period` does not recognize non-traditional frequencies else: diff --git a/greykite/sklearn/estimator/base_silverkite_estimator.py b/greykite/sklearn/estimator/base_silverkite_estimator.py index 4aa3d46..c40a56b 100644 --- a/greykite/sklearn/estimator/base_silverkite_estimator.py +++ b/greykite/sklearn/estimator/base_silverkite_estimator.py @@ -22,17 +22,25 @@ """sklearn estimator with common functionality between SilverkiteEstimator and SimpleSilverkiteEstimator. """ - +import re from typing import Dict from typing import Optional +from typing import Type import pandas as pd +import plotly.express as px from pandas.tseries.frequencies import to_offset +from plotly import graph_objects as go +from plotly.subplots import make_subplots from sklearn.exceptions import NotFittedError from sklearn.metrics import mean_squared_error +from greykite.algo.changepoint.adalasso.changepoints_utils import get_trend_changepoint_dates_from_cols from greykite.algo.common.col_name_utils import create_pred_category from greykite.algo.common.ml_models import breakdown_regression_based_prediction +from greykite.algo.common.model_summary import ModelSummary +from greykite.algo.forecast.silverkite.constants.silverkite_component import SilverkiteComponentsEnum +from greykite.algo.forecast.silverkite.constants.silverkite_constant import default_silverkite_constant from greykite.algo.forecast.silverkite.forecast_silverkite import SilverkiteForecast from greykite.algo.forecast.silverkite.forecast_silverkite_helper import get_silverkite_uncertainty_dict from greykite.common import constants as cst @@ -40,7 +48,6 @@ from greykite.common.time_properties import min_gap_in_seconds from greykite.common.time_properties_forecast import get_simple_time_frequency_from_period from greykite.sklearn.estimator.base_forecast_estimator import BaseForecastEstimator -from greykite.sklearn.estimator.silverkite_diagnostics import SilverkiteDiagnostics from greykite.sklearn.uncertainty.uncertainty_methods import UncertaintyMethodEnum @@ -92,8 +99,6 @@ class BaseSilverkiteEstimator(BaseForecastEstimator): ---------- silverkite : Class or a derived class of `~greykite.algo.forecast.silverkite.forecast_silverkite.SilverkiteForecast` The silverkite algorithm instance used for forecasting - silverkite_diagnostics : Class or a derived class of `~greykite.sklearn.estimator.silverkite_diagnostics.SilverkiteDiagnostics` - The silverkite class used for plotting and generating model summary. model_dict : `dict` or None A dict with fitted model and its attributes. The output of `~greykite.algo.forecast.silverkite.forecast_silverkite.SilverkiteForecast.forecast`. @@ -144,43 +149,35 @@ class BaseSilverkiteEstimator(BaseForecastEstimator): - ``fs_components_df["seas_names"]`` (e.g. ``daily``, ``weekly``) is appended to the column names, if provided. - `~greykite.sklearn.estimator.silverkite_diagnostics.SilverkiteDiagnostics.plot_silverkite_components` groups - based on ``fs_components_df["seas_names"]`` passed to ``forecast_silverkite`` during fit. - E.g. any column containing ``daily`` is added to daily seasonality effect. The reason - is as follows: - - 1. User can provide ``tow`` and ``str_dow`` for weekly seasonality. - These should be aggregated, and we can do that only based on "seas_names". - 2. yearly and quarterly seasonality both use ``ct1`` as "names" column. - Only way to distinguish those effects is via "seas_names". - 3. ``ct1`` is also used for growth. If it is interacted with seasonality, the columns become - indistinguishable without "seas_names". + `~greykite.sklearn.estimator.base_silverkite_estimator.BaseSilverkiteEstimator.plot_components` relies + on a regular expression dictionary to group components together. There are two available in the library, see + `~greykite.common.constants` for the two definitions - Additionally, the function sets yaxis labels based on ``seas_names``: - ``daily`` as ylabel is much more informative than ``tod`` as ylabel in component plots. + 1. "DEFAULT_COMPONENTS_REGEX_DICT" + Grouped seasonality that is the default + 2. "DETAILED_SEASONALITY_COMPONENTS_REGEX_DICT": + A detailed seasonality breakdown where the user can view daily/weekly/monthly/quarterly/yearly seasonality """ def __init__( self, silverkite: SilverkiteForecast = SilverkiteForecast(), - silverkite_diagnostics: SilverkiteDiagnostics = SilverkiteDiagnostics(), score_func: callable = mean_squared_error, coverage: float = None, null_model_params: Optional[Dict] = None, uncertainty_dict: Optional[Dict] = None): - # initializes null model + # Initializes null model super().__init__( score_func=score_func, coverage=coverage, null_model_params=null_model_params) - # required in subclasses __init__ + # Required in subclasses __init__ self.uncertainty_dict = uncertainty_dict - # set by `fit` + # Set by `fit` # fitted model in dictionary format returned from # the `forecast_silverkite` function self.silverkite: SilverkiteForecast = silverkite - self.silverkite_diagnostics: SilverkiteDiagnostics = silverkite_diagnostics self.model_dict = None self.pred_cols = None self.feature_cols = None @@ -200,13 +197,17 @@ def __init__( self._pred_category = None self.extra_pred_cols = None # all silverkite estimators should support this. - # set by the predict method + # Set by the predict method self.forecast = None - # set by predict method + # Set by predict method self.forecast_x_mat = None - # set by the summary method + # Set by the summary method self.model_summary = None + # Needed for diagnostics + self._silverkite_components_enum: Type[SilverkiteComponentsEnum] = default_silverkite_constant.get_silverkite_components_enum() + self.components = None + def __set_uncertainty_dict(self, X, time_col, value_col): """Checks if ``coverage`` is consistent with the ``uncertainty_dict`` used to train the ``forecast_silverkite`` model. Sets ``uncertainty_dict`` @@ -311,20 +312,14 @@ def finish_fit(self): self.pred_cols = self.model_dict["pred_cols"] self.feature_cols = self.model_dict["x_mat"].columns - # model coefficients + # Model coefficients if hasattr(self.model_dict["ml_model"], "coef_"): self.coef_ = pd.DataFrame( self.model_dict["ml_model"].coef_, index=self.feature_cols) - self._set_silverkite_diagnostics_params() return self - def _set_silverkite_diagnostics_params(self): - if self.silverkite_diagnostics is None: - self.silverkite_diagnostics = SilverkiteDiagnostics() - self.silverkite_diagnostics.set_params(self.pred_category, self.time_col_, self.value_col_) - def predict(self, X, y=None): """Creates forecast for the dates specified in ``X``. @@ -391,7 +386,7 @@ def predict(self, X, y=None): self.forecast = pred_df self.forecast_x_mat = x_mat - # renames columns to standardized schema + # Renames columns to standardized schema output_columns = { self.time_col_: cst.TIME_COL} if cst.PREDICTED_COL in pred_df.columns: @@ -445,7 +440,7 @@ def forecast_breakdown( ---------- grouping_regex_patterns_dict : `dict` {`str`: `str`} A dictionary with group names as keys and regexes as values. - This dictinary is used to partition the columns into various groups + This dictionary is used to partition the columns into various groups forecast_x_mat : `pd.DataFrame`, default None The dataframe of design matrix of regression model. If None, this will be extracted from the estimator. @@ -459,7 +454,7 @@ def forecast_breakdown( Therefore we simply create an integer index with size of ``forecast_x_mat``. center_components : `bool`, default False It determines if components should be centered at their mean and the mean - be added to the intercept. More concretely if a componet is "x" then it will + be added to the intercept. More concretely if a component is "x" then it will be mapped to "x - mean(x)"; and "mean(x)" will be added to the intercept so that the sum of the components remains the same. denominator : `str`, default None @@ -588,56 +583,313 @@ def get_max_ar_order(self): return max_order def summary(self, max_colwidth=20): - if self.silverkite_diagnostics is None: - self._set_silverkite_diagnostics_params() - return self.silverkite_diagnostics.summary(self.model_dict, max_colwidth) - - def plot_components(self, names=None, title=None): - if self.model_dict is None: - raise NotFittedError("Call `fit` before calling `plot_components`.") - if self.silverkite_diagnostics is None: - self._set_silverkite_diagnostics_params() - return self.silverkite_diagnostics.plot_components(self.model_dict, names, title) - - def plot_trend(self, title=None): - """Convenience function to plot the data and the trend component. + """Creates the model summary for the given model Parameters ---------- - title: `str`, optional, default `None` - Plot title. + max_colwidth : `int` + The maximum length for predictors to be shown in their original name. + If the maximum length of predictors exceeds this parameter, all + predictors name will be suppressed and only indices are shown. Returns ------- - fig: `plotly.graph_objects.Figure` - Figure. + model_summary: `ModelSummary` + The model summary for this model. See `~greykite.algo.common.model_summary.ModelSummary` """ - if title is None: - title = "Trend plot" - return self.plot_components(names=["trend"], title=title) - - def plot_seasonalities(self, title=None): - """Convenience function to plot the data and the seasonality components. + if self.model_dict is not None: + self.model_summary = ModelSummary( + x=self.model_dict["x_mat"].values, + y=self.model_dict["y"].values, + pred_cols=list(self.model_dict["x_mat"].columns), + pred_category=self.pred_category, + fit_algorithm=self.model_dict["fit_algorithm"], + ml_model=self.model_dict["ml_model"], + max_colwidth=max_colwidth) + return self.model_summary + + def plot_components( + self, + grouping_regex_patterns_dict=None, + center_components=True, + denominator=None, + predict_phase=False, + title=None): + """Class method to plot the components of a ``Silverkite`` model on datasets passed to either + ``fit`` or ``predict``. Parameters ---------- - title: `str`, optional, default `None` - Plot title. + grouping_regex_patterns_dict : `dict`, optional, default None + If None, it is set to `~greykite.common.constants.DEFAULT_COMPONENTS_REGEX_DICT`. + An alternative dictionary is available that provides a more detailed breakdown of + seasonality components (e.g., weekly, monthly, quarterly, yearly, etc.), See: + `~greykite.common.constants.DETAILED_SEASONALITY_COMPONENTS_REGEX_DICT`. + center_components : `bool`, optional, default True + It determines if components should be centered at their mean and the mean + be added to the intercept. More concretely if a component is "x" then it will + be mapped to "x - mean(x)"; and "mean(x)" will be added to the intercept so + that the sum of the components remains the same. + See `~greykite.sklearn.estimator.base_silverkite_estimator.forecast_breakdown`. + denominator : `str`, optional, default None + If not None, it will specify a way to divide the components. There are + two options implemented: + + - "abs_y_mean" : `float` + The absolute value of the observed mean of the response + - "y_std" : `float` + The standard deviation of the observed response + See `~greykite.sklearn.estimator.base_silverkite_estimator.forecast_breakdown`. + predict_phase: `bool`, optional, default False + If False, plots the components of the training data and shows three plots: 1) Component + Plot, 2) Trend Plot + Change points, and 3) Residuals + Smoothed Residuals. + If set to True, plots the component breakdown of the predicted values. When set to True, + it only plots one plot, the component plot, as there are no change points or residuals + in this time frame. + title: `str`, optional, default None + Title of the plot. Returns ------- fig: `plotly.graph_objects.Figure` - Figure. + Figure plotting components against appropriate time scale. Plot layout includes: + - Plot 1, "Component Plot" - breakdown from forecast_breakdown + - Plot 2, "Trend + Change Points" + - Plot 3, "Residuals + Smoothed Residuals"; smoothing done using exponentially weighted moving average """ - if title is None: - title = "Seasonality plot" - seas_names = [ - "DAILY_SEASONALITY", - "WEEKLY_SEASONALITY", - "MONTHLY_SEASONALITY", - "QUARTERLY_SEASONALITY", - "YEARLY_SEASONALITY"] - return self.plot_components(names=seas_names, title=title) + if self.model_dict is None: + raise NotFittedError("Call `fit` before calling `plot_components`.") + if self.forecast_x_mat is None and predict_phase is True: + raise ValueError("Call the predict method before calling `plot_components` to generate forecasts") + + if not hasattr(self.model_dict["ml_model"], "coef_"): + raise NotImplementedError("Component plot has only been implemented for additive linear models.") + + if type(center_components) is not bool: + raise TypeError("center_components must be bool: True/False") + + if denominator is not None: + if denominator not in ["abs_y_mean", "y_std"]: + raise ValueError("Choose denominator from: ['abs_y_mean', 'y_std']") + + # Defines regex dictionary to be the default component dictionary: + if grouping_regex_patterns_dict is None: + grouping_regex_patterns_dict = cst.DEFAULT_COMPONENTS_REGEX_DICT + if type(grouping_regex_patterns_dict) is not dict: + raise TypeError("grouping_regex_patterns_dict must be type dict") + if len(grouping_regex_patterns_dict) == 0: + raise ValueError("grouping_regex_patterns_dict must be non-empty") + + # Chooses `x_mat` and `time_values` to use + # Creates default plot title if not user supplied + if not predict_phase: + x_mat = self.model_dict["x_mat"].reset_index(drop=True) + time_values = pd.to_datetime(self.model_dict["df_dropna"][self.model_dict["time_col"]]) + if title is None: + title = "Component Plot - Training" + else: + train_end_date = self.model_dict["last_date_for_fit"] + x_mat = self.forecast_x_mat.reset_index(drop=True) + time_values = pd.to_datetime(self.forecast[self.time_col_]) + if title is None: + title = "Component Plot - Predicted" + + # Builds a forecast breakdown + breakdown = self.forecast_breakdown( + grouping_regex_patterns_dict=grouping_regex_patterns_dict, + forecast_x_mat=x_mat, + time_values=time_values, + center_components=center_components, + denominator=denominator, + plt_title=title + ) + + # Selects results for component dataframe + df = breakdown["breakdown_df"] + self.fit_components = df + + # Collects change points from estimator + changepoint_columns = [x.split(":")[0] for x in self.model_dict["x_mat"].columns if re.match("changepoint", x) is not None] + change_points = get_trend_changepoint_dates_from_cols(trend_cols=set(changepoint_columns)) + + # Calculates residuals and a smoothed estimate of residuals + y_true = self.model_dict["y_train"].values + y_pred = self.model_dict["y_train_pred"] + residuals = pd.Series(y_true - y_pred) + residuals_smoothed = residuals.ewm(int(len(residuals)/50)).mean() + + # Defines a color palette for the figure + line_colors = px.colors.qualitative.Bold + + # Creates the figure + if predict_phase: + num_rows = 1 + subplot_title_set = ["Component plot"] + else: + num_rows = 3 + subplot_title_set = ["Component plot", "Trend and change points", "Residuals"] + + fig = make_subplots( + rows=num_rows, + cols=1, + vertical_spacing=0.5 / num_rows, + subplot_titles=subplot_title_set, + shared_xaxes=True + ) + + # Panel 1 - Adds breakdown component traces to the figure + for i, column_name in enumerate(df.columns): + if column_name != "Trend": + fig.add_trace( + go.Scatter( + x=time_values, + y=df[column_name], + name=column_name, + line=go.scatter.Line( + color=line_colors[i]), + opacity=0.8), + row=1, + col=1) + else: + # If "Trend" line, adds legendgroup so that trend lines toggle on/off together + fig.add_trace( + go.Scatter( + x=time_values, + y=df[column_name], + name=column_name, + line=go.scatter.Line( + color=line_colors[i]), + opacity=0.8, + legendgroup="TrendGroup"), + row=1, + col=1) + + # Adds rangeslider under the first panel + fig.update_xaxes( + rangeslider_visible=True, + rangeslider_thickness=0.05, + title_text="Date", + row=1, + col=1 + ) + # By default, rangeslider turns off vertical zoom, this turns it back on + fig.update_yaxes(fixedrange=False) + + # Only plot second and third panels if not predict_phase + if predict_phase: + if train_end_date is not None and train_end_date in time_values.to_list(): + new_layout = dict( + # Adds vertical line + shapes=[dict( + type="line", + xref="x", + yref="paper", # y-reference is assigned to the plot paper [0,1] + x0=pd.to_datetime(train_end_date), + y0=0, + x1=pd.to_datetime(train_end_date), + y1=1, + line=dict( + color="rgba(100, 100, 100, 0.9)", + width=1.0) + )], + # Adds text annotation + annotations=[dict( + xref="x", + xanchor="right", + yanchor="middle", + x=pd.to_datetime(train_end_date), + yref="paper", + y=.97, + text="Train End Date", + showarrow=True, + arrowhead=0, + ax=-20, + axref="pixel", + ay=0 + )] + ) + fig.update_layout(new_layout) + + # Updates title based on user input, centers the title, adjusts spacing, and turns on tick labels for all plots + fig.update_layout( + title={ + "text": title, + "x": 0.5 + } + ) + else: + # Panel 2 - Adds trace for trend plot + fig.add_trace( + go.Scatter( + x=time_values, + y=df["Trend"], + name="Trend", + line=go.scatter.Line( + color=line_colors[df.columns.to_list().index("Trend")]), + opacity=0.8, + legendgroup="TrendGroup", + showlegend=False), + row=2, + col=1) + + # Panel 2 - Adds traces for changepoints. All change groups are one group in legend so they toggle on/off together. + if change_points is not None: + for cp_num, cp in enumerate(change_points): + if cp_num == 0: + in_legend = True + else: + in_legend = False + + fig.add_trace( + go.Scatter( + x=[pd.to_datetime(cp), pd.to_datetime(cp)], + y=[df["Trend"].min(), df["Trend"].max()], + name="Changepoints", + legendgroup="Changepoints", + mode="lines", + line=go.scatter.Line( + color="#000000", # black + width=2, + dash="dot"), + opacity=0.75, + showlegend=in_legend), + row=2, + col=1) + + # Panel 3 - Adds traces for the residuals and smoothed residuals + fig.add_trace( + go.Scatter( + x=pd.to_datetime(self.df[self.model_dict["time_col"]]), + y=residuals, + name="Residuals", + line=go.scatter.Line( + color="rgb(0,0,0)"), + opacity=0.75), + row=3, + col=1) + fig.add_trace( + go.Scatter( + x=pd.to_datetime(self.df[self.model_dict["time_col"]]), + y=residuals_smoothed, + name="Smoothed Residuals", + line=go.scatter.Line( + color="rgb(250,237,9)"), + opacity=0.75), + row=3, + col=1) + + # Updates title based on user input, centers the title, adjusts spacing, and turns on tick labels for all plots + fig.update_layout( + title={ + "text": title, + "x": 0.5 + }, + height=350 * num_rows, + xaxis_showticklabels=True, + xaxis2_showticklabels=True + ) + + return fig def plot_trend_changepoint_detection(self, params=None): """Convenience function to plot the original trend changepoint detection results. diff --git a/greykite/sklearn/estimator/lag_based_estimator.py b/greykite/sklearn/estimator/lag_based_estimator.py index 255a43a..0d6abf1 100644 --- a/greykite/sklearn/estimator/lag_based_estimator.py +++ b/greykite/sklearn/estimator/lag_based_estimator.py @@ -38,6 +38,7 @@ from greykite.common.logging import LoggingLevelEnum from greykite.common.logging import log_message from greykite.common.time_properties import fill_missing_dates +from greykite.common.time_properties import infer_freq from greykite.sklearn.estimator.base_forecast_estimator import BaseForecastEstimator @@ -308,7 +309,7 @@ def _prepare_df(self): self.df[self.time_col_] = pd.to_datetime(self.df[self.time_col_]) self.df = self.df.sort_values(by=self.time_col_).reset_index(drop=True) # Infers data frequency. - freq = pd.infer_freq(self.df[self.time_col_]) + freq = infer_freq(self.df, self.time_col_) if self.freq is None: self.freq = freq if freq is not None and self.freq != freq: diff --git a/greykite/sklearn/estimator/multistage_forecast_estimator.py b/greykite/sklearn/estimator/multistage_forecast_estimator.py index 922d238..f544106 100644 --- a/greykite/sklearn/estimator/multistage_forecast_estimator.py +++ b/greykite/sklearn/estimator/multistage_forecast_estimator.py @@ -38,6 +38,7 @@ from greykite.common.aggregation_function_enum import AggregationFunctionEnum from greykite.common.logging import LoggingLevelEnum from greykite.common.logging import log_message +from greykite.common.time_properties import infer_freq from greykite.sklearn.estimator.base_forecast_estimator import BaseForecastEstimator from greykite.sklearn.estimator.simple_silverkite_estimator import SimpleSilverkiteEstimator @@ -259,7 +260,7 @@ def fit( value_col=value_col, **fit_params) if self.freq is None: - self.freq = pd.infer_freq(X[time_col]) + self.freq = infer_freq(X, time_col) if self.freq is None: raise ValueError("Failed to infer frequency from data, please provide during " "instantiation. Data frequency is required for aggregation.") diff --git a/greykite/sklearn/estimator/prophet_estimator.py b/greykite/sklearn/estimator/prophet_estimator.py index 53e4a8e..0a6a4ac 100644 --- a/greykite/sklearn/estimator/prophet_estimator.py +++ b/greykite/sklearn/estimator/prophet_estimator.py @@ -76,7 +76,7 @@ class ProphetEstimator(BaseForecastEstimator): } add_seasonality_dict: dict of custom seasonality parameters to be added to the model, optional, default=None - parameter details: https://github.com/facebook/prophet/blob/master/python/prophet/forecaster.py - refer to + parameter details: https://github.com/facebook/prophet/blob/main/python/prophet/forecaster.py - refer to add_seasonality() function. Key is the seasonality component name e.g. 'monthly'; parameters are specified via dict. @@ -115,7 +115,7 @@ class ProphetEstimator(BaseForecastEstimator): Prophet documentation for a description: * https://facebook.github.io/prophet/docs/quick_start.html - * https://github.com/facebook/prophet/blob/master/python/prophet/forecaster.py + * https://github.com/facebook/prophet/blob/main/python/prophet/forecaster.py Attributes ---------- @@ -378,6 +378,6 @@ def plot_components( if "'DatetimeIndex'" in repr(e): # 'DatetimeIndex' object has no attribute 'weekday_name' raise Exception("Prophet 0.5 component plots are incompatible with pandas 1.*. " - "Upgrade to prophet:0.6 or higher.") + "Upgrade to Prophet 0.6 or higher.") else: raise e diff --git a/greykite/sklearn/estimator/silverkite_diagnostics.py b/greykite/sklearn/estimator/silverkite_diagnostics.py deleted file mode 100644 index eb21db0..0000000 --- a/greykite/sklearn/estimator/silverkite_diagnostics.py +++ /dev/null @@ -1,432 +0,0 @@ -# -# Licensed to the Apache Software Foundation (ASF) under one or more -# contributor license agreements. See the NOTICE file distributed with -# this work for additional information regarding copyright ownership. -# The ASF licenses this file to You under the Apache License, Version 2.0 -# (the "License"); you may not use this file except in compliance with -# the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# original author: Sayan Patra -"""Silverkite plotting functions.""" -import warnings -from typing import Type - -import numpy as np -import pandas as pd -from plotly import graph_objects as go -from plotly.subplots import make_subplots - -from greykite.algo.changepoint.adalasso.changepoints_utils import get_trend_changepoint_dates_from_cols -from greykite.algo.common.model_summary import ModelSummary -from greykite.algo.forecast.silverkite.constants.silverkite_component import SilverkiteComponentsEnum -from greykite.algo.forecast.silverkite.constants.silverkite_component import SilverkiteComponentsEnumMixin -from greykite.algo.forecast.silverkite.constants.silverkite_constant import default_silverkite_constant -from greykite.common import constants as cst -from greykite.common.python_utils import get_pattern_cols -from greykite.common.viz.timeseries_plotting import add_groupby_column -from greykite.common.viz.timeseries_plotting import grouping_evaluation - - -class SilverkiteDiagnostics: - """Provides various plotting functions for the model generated by the Silverkite forecast algorithms. - - Attributes - ---------- - _silverkite_components_enum : Type[SilverkiteComponentsEnum] - The constants for plotting the silverkite components. - model_dict : `dict` or None - A dict with fitted model and its attributes. - The output of `~greykite.algo.forecast.silverkite.forecast_silverkite.SilverkiteForecast.forecast`. - pred_category : `dict` or None - A dictionary with keys being the predictor category and - values being the predictors belonging to the category. - For details, see - `~greykite.sklearn.estimator.base_silverkite_estimator.BaseSilverkiteEstimator.pred_category`. - time_col : str - Name of input data time column - value_col : str - Name of input data value column - components : `dict` or None - Components of the ``SilverkiteEstimator`` model. Set by ``self.plot_components``. - For details about the possible key values see - `~greykite.sklearn.estimator.silverkite_diagnostics.SilverkiteDiagnostics.get_silverkite_components`. - Not available for ``random forest`` and ``gradient boosting`` methods and - set to the default value `None`. - model_summary : `class` or `None` - The `~greykite.algo.common.model_summary.ModelSummary` class. - """ - def __init__( - self, - constants: SilverkiteComponentsEnumMixin = default_silverkite_constant): - self._silverkite_components_enum: Type[SilverkiteComponentsEnum] = constants.get_silverkite_components_enum() - self.pred_category = None - self.time_col = None - self.value_col = None - self.components = None - self.model_summary = None - - def set_params(self, pred_category, time_col, value_col): - """ - Set the various params after the model has been created. - - Parameters - ---------- - pred_category : `dict` or None - A dictionary with keys being the predictor category and - values being the predictors belonging to the category. - For details, see `~greykite.sklearn.estimator.base_silverkite_estimator.BaseSilverkiteEstimator.pred_category`. - time_col: `str` - Time column name in the data frame. - value_col: `str` - Value column name in the data frame. - """ - self.pred_category = pred_category - self.time_col = time_col - self.value_col = value_col - - def summary(self, model_dict, max_colwidth=20) -> ModelSummary: - """Creates the model summary for the given model - - Parameters - ---------- - model_dict : `dict` or None - A dict with fitted model and its attributes. - max_colwidth : `int` - The maximum length for predictors to be shown in their original name. - If the maximum length of predictors exceeds this parameter, all - predictors name will be suppressed and only indices are shown. - - Returns - ------- - model_summary: `ModelSummary` - The model summary for this model. See `~greykite.algo.common.model_summary.ModelSummary` - """ - - if model_dict is not None: - # tree models do not have beta - self.model_summary = ModelSummary( - x=model_dict["x_mat"].values, - y=model_dict["y"].values, - pred_cols=list(model_dict["x_mat"].columns), - pred_category=self.pred_category, - fit_algorithm=model_dict["fit_algorithm"], - ml_model=model_dict["ml_model"], - max_colwidth=max_colwidth) - else: - self.model_summary = None - return self.model_summary - - def plot_components(self, model_dict, names=None, title=None): - """Class method to plot the components of a ``Silverkite`` model on the dataset passed to ``fit``. - - Parameters - ---------- - model_dict : `dict` or None - A dict with fitted model and its attributes. - names: `list` [`str`], default `None` - Names of the components to be plotted e.g. names = ["trend", "DAILY_SEASONALITY"]. - See `~greykite.sklearn.estimator.silverkite_diagnostics.get_silverkite_components` - for the full list of valid names. - If `None`, all the available components are plotted. - title: `str`, optional, default `None` - Title of the plot. If `None`, default title is "Component plot". - - Returns - ------- - fig: `plotly.graph_objects.Figure` - Figure plotting components against appropriate time scale. - """ - if model_dict is None: - raise NotImplementedError("Call `self.set_params` before calling `plot_components`.") - - # recomputes `self.components` every time in case model was refit - if not hasattr(model_dict["ml_model"], "coef_"): - raise NotImplementedError("Component plot has only been implemented for additive linear models.") - else: - # Computes components for the training observations used to fit the model. - # Observations with NAs that are dropped when fitting are not included. - x_mat = model_dict["x_mat"] - ml_model_coef = model_dict["ml_model"].coef_ - ml_model_intercept = model_dict["ml_model"].intercept_ - data_len = len(x_mat) - ml_cols = list(x_mat.columns) - - x_mat_weighted = ml_model_coef * x_mat - if ml_model_intercept: - if "Intercept" in ml_cols: - x_mat_weighted["Intercept"] += np.repeat(ml_model_intercept, data_len) - else: - x_mat_weighted["Intercept"] = np.repeat(ml_model_intercept, data_len) - - self.components = self.get_silverkite_components( - df=model_dict["df_dropna"], - time_col=self.time_col, - value_col=self.value_col, - feature_df=x_mat_weighted) - - return self.plot_silverkite_components( - components=self.components, - names=names, - title=title) - - def get_silverkite_components( - self, - df, - time_col, - value_col, - feature_df): - """Compute the components of a ``Silverkite`` model. - - Notes - ----- - This function signature is chosen this way so that an user using `forecast_silverkite` can also use - this function, without any changes to the `forecast_silverkite` function. User can compute `feature_df` - as follows. Here `model_dict` is the output of `forecast_silverkite`. - feature_df = model_dict["mod"].coef_ * model_dict["design_mat"] - - The function aggregates components based on the column names of `feature_df`. - `feature_df` is defined as the patsy design matrix built by `design_mat_from_formula` - multiplied by the corresponding coefficients, estimated by the silverkite model. - - - ``cst.TREND_REGEX``: Used to identify `feature_df` columns corresponding to trend. - See `greykite.common.features.timeseries_features.get_changepoint_features` for details - about changepoint column names. - - ``cst.SEASONALITY_REGEX``: Used to identify `feature_df` columns corresponding to seasonality. - This means to get correct seasonalities, the user needs to provide seas_names. - See `greykite.common.features.timeseries_features.get_fourier_col_name` for details - about seasonality column names. - - ``cst.EVENT_REGEX``: Used to identify `feature_df` columns corresponding to events such as holidays. - See `~greykite.common.features.timeseries_features.add_daily_events` for details - about event column names. - - Parameters - ---------- - df : `pandas.DataFrame` - A dataframe containing `time_col`, `value_col` and `regressors`. - time_col : `str` - The name of the time column in ``df``. - value_col : `str` - The name of the value column in ``df``. - feature_df : `pandas.DataFrame` - A dataframe containing feature columns and values. - - Returns - ------- - components : `pandas.DataFrame` - Contains the components of the model. Same number of rows as `df`. Possible columns are - - - `"time_col"`: same as input ``time_col``. - - `"value_col"`: same as input ``value_col``. - - `"trend"`: column containing the trend. - - `"autoregression"`: column containing the autoregression. - - `"lagged_regressor"`: column containing the lagged regressors. - - `"DAILY_SEASONALITY"`: column containing daily seasonality. - - `"WEEKLY_SEASONALITY"`: column containing weekly seasonality. - - `"MONTHLY_SEASONALITY"`: column containing monthly seasonality. - - `"QUARTERLY_SEASONALITY"`: column containing quarterly seasonality. - - `"YEARLY_SEASONALITY"`: column containing yearly seasonality. - - `"events"`: column containing events e.g. holidays effect. - - `"residual"`: column containing residuals. - - """ - if feature_df is None or feature_df.empty: - raise ValueError("feature_df must be non-empty") - - if df.shape[0] != feature_df.shape[0]: - raise ValueError("df and feature_df must have same number of rows.") - - feature_cols = feature_df.columns - components = df[[time_col, value_col]] - - # gets trend (this includes interaction terms) - trend_cols = get_pattern_cols(feature_cols, cst.TREND_REGEX, f"{cst.SEASONALITY_REGEX}|{cst.LAG_REGEX}") - if trend_cols: - components["trend"] = feature_df[trend_cols].sum(axis=1) - - # gets lagged terms (auto regression, lagged regressors and corresponding interaction terms) - lag_cols = get_pattern_cols(feature_cols, cst.LAG_REGEX) - if lag_cols: - ar_cols = [lag_col for lag_col in lag_cols if value_col in lag_col] - if ar_cols: - components["autoregression"] = feature_df[ar_cols].sum(axis=1) - lagged_regressor_cols = [lag_col for lag_col in lag_cols if value_col not in lag_col] - if lagged_regressor_cols: - components["lagged_regressor"] = feature_df[lagged_regressor_cols].sum(axis=1) - - # gets seasonalities - seas_cols = get_pattern_cols(feature_cols, cst.SEASONALITY_REGEX) - seas_components_dict = self._silverkite_components_enum.__dict__["_member_names_"].copy() - for seas in seas_components_dict: - seas_pattern = self._silverkite_components_enum[seas].value.ylabel - seas_pattern_cols = get_pattern_cols(seas_cols, seas_pattern) - if seas_pattern_cols: - components[seas] = feature_df[seas_pattern_cols].sum(axis=1) - - # gets events (holidays for now) - event_cols = get_pattern_cols(feature_cols, cst.EVENT_REGEX) - if event_cols: - components["events"] = feature_df[event_cols].sum(axis=1) - - # calculates residuals - components["residual"] = df[value_col].values - feature_df.sum(axis=1).values - - # gets trend changepoints - # keeps this column as the last column of the df - if trend_cols: - changepoint_dates = get_trend_changepoint_dates_from_cols(trend_cols=trend_cols) - if changepoint_dates: - ts = pd.to_datetime(components[time_col]) - changepoints = [1 if t in changepoint_dates else 0 for t in ts] - components["trend_changepoints"] = changepoints - - return components - - def group_silverkite_seas_components(self, df): - """Groups and renames``Silverkite`` seasonalities. - - Parameters - ---------- - df: `pandas.DataFrame` - DataFrame containing two columns: - - - ``time_col``: Timestamps of the original timeseries. - - ``seas``: A seasonality component. It must match a component name from the - `~greykite.algo.forecast.silverkite.constants.silverkite_component.SilverkiteComponentsEnum`. - - Returns - ------- - `pandas.DataFrame` - DataFrame grouped by the time feature corresponding to the seasonality - and renamed as defined in - `~greykite.algo.forecast.silverkite.constants.silverkite_component.SilverkiteComponentsEnum`. - """ - time_col, seas = df.columns - groupby_time_feature = self._silverkite_components_enum[seas].value.groupby_time_feature - xlabel = self._silverkite_components_enum[seas].value.xlabel - ylabel = self._silverkite_components_enum[seas].value.ylabel - - def grouping_func(grp): - return np.nanmean(grp[seas]) - - result = add_groupby_column( - df=df, - time_col=time_col, - groupby_time_feature=groupby_time_feature) - grouped_df = grouping_evaluation( - df=result["df"], - groupby_col=result["groupby_col"], - grouping_func=grouping_func, - grouping_func_name=ylabel) - grouped_df.rename({result["groupby_col"]: xlabel}, axis=1, inplace=True) - return grouped_df - - def plot_silverkite_components( - self, - components, - names=None, - title=None): - """Plot the components of a ``Silverkite`` model. - - Parameters - ---------- - components : `pandas.DataFrame` - A dataframe containing the components of a silverkite model, similar to the output - of `~greykite.sklearn.estimator.silverkite_diagnostics.get_silverkite_components`. - names: `list` [`str`], optional, default `None` - Names of the components to be plotted e.g. names = ["trend", "DAILY_SEASONALITY"]. - See `~greykite.sklearn.estimator.silverkite_diagnostics.get_silverkite_components` - for the full list of valid names. - If `None`, all the available components are plotted. - title: `str`, optional, default `None` - Title of the plot. If `None`, default title is "Component plot". - - Returns - ------- - fig: `plotly.graph_objects.Figure` - Figure plotting components against appropriate time scale. - - Notes - ----- - If names in `None`, all the available components are plotted. - ``value_col`` is always plotted in the first panel, as long as there is a match between - given ``names`` list and ``components.columns``. - - See Also - -------- - `~greykite.sklearn.estimator.silverkite_diagnostics.get_silverkite_components` - """ - - time_col, value_col = components.columns[:2] - if "trend_changepoints" in components.columns: - trend_changepoints = components[time_col].loc[components["trend_changepoints"] == 1].tolist() - components = components.drop("trend_changepoints", axis=1) - else: - trend_changepoints = None - if names is None: - names_kept = list(components.columns)[1:] # do not include time_col - else: - # loops over components.columns to maintain the order of the components - names_kept = [component for component in list(components.columns) if component in names] - names_removed = set(names) - set(components.columns) - - if not names_kept: - raise ValueError("None of the provided components have been specified in the model.") - elif names_removed: - warnings.warn(f"The following components have not been specified in the model: " - f"{names_removed}, plotting the rest.") - if names_kept[0] != value_col: - names_kept.insert(0, value_col) - - num_rows = len(names_kept) - fig = make_subplots(rows=num_rows, cols=1, vertical_spacing=0.35 / num_rows) - if title is None: - title = "Component plots" - fig.update_layout(dict(showlegend=True, title=title, title_x=0.5, height=350 * num_rows)) - - for ind, name in enumerate(names_kept): - df = components[[time_col, name]] - if "SEASONALITY" in name: - df = self.group_silverkite_seas_components(df) - - xlabel, ylabel = df.columns - row = ind + 1 - fig.append_trace(go.Scatter( - x=df[xlabel], - y=df[ylabel], - name=name, - mode="lines", - opacity=0.8, - showlegend=False - ), row=row, col=1) - - # `showline = True` shows a line only along the axes. i.e. for xaxis it will line the bottom - # of the image, but not top. Adding `mirror = True` also adds the line to the top. - fig.update_xaxes(title_text=xlabel, showline=True, mirror=True, row=row, col=1) - fig.update_yaxes(title_text=ylabel, showline=True, mirror=True, row=row, col=1) - - # plot trend change points - if trend_changepoints is not None and "trend" in names_kept: - for i, cp in enumerate(trend_changepoints): - show_legend = (i == 0) - fig.append_trace( - go.Scatter( - name="trend change point", - mode="lines", - x=[cp, cp], - y=[components["trend"].min(), components["trend"].max()], - line=go.scatter.Line( - color="#F44336", # red 500 - width=1.5, - dash="dash"), - showlegend=show_legend), - row=names_kept.index("trend") + 1, - col=1) - - return fig diff --git a/greykite/sklearn/estimator/silverkite_estimator.py b/greykite/sklearn/estimator/silverkite_estimator.py index 3aea70c..06f8814 100644 --- a/greykite/sklearn/estimator/silverkite_estimator.py +++ b/greykite/sklearn/estimator/silverkite_estimator.py @@ -28,7 +28,6 @@ from greykite.common import constants as cst from greykite.common.python_utils import update_dictionary from greykite.sklearn.estimator.base_silverkite_estimator import BaseSilverkiteEstimator -from greykite.sklearn.estimator.silverkite_diagnostics import SilverkiteDiagnostics class SilverkiteEstimator(BaseSilverkiteEstimator): @@ -101,7 +100,6 @@ class SilverkiteEstimator(BaseSilverkiteEstimator): def __init__( self, silverkite: SilverkiteForecast = SilverkiteForecast(), - silverkite_diagnostics: SilverkiteDiagnostics = SilverkiteDiagnostics(), score_func=mean_squared_error, coverage=None, null_model_params=None, @@ -114,6 +112,8 @@ def __init__( training_fraction=None, fit_algorithm_dict=None, daily_event_df_dict=None, + daily_event_neighbor_impact=None, + daily_event_shifted_effect=None, fs_components_df=pd.DataFrame({ "name": [ cst.TimeFeaturesEnum.tod.value, @@ -138,11 +138,11 @@ def __init__( forecast_horizon=None, simulation_based=False, simulation_num=10, - fast_simulation=False): + fast_simulation=False, + remove_intercept=False): # every subclass of BaseSilverkiteEstimator must call super().__init__ super().__init__( silverkite=silverkite, - silverkite_diagnostics=silverkite_diagnostics, score_func=score_func, coverage=coverage, null_model_params=null_model_params, @@ -162,6 +162,8 @@ def __init__( self.fit_algorithm_dict = fit_algorithm_dict self.training_fraction = training_fraction self.daily_event_df_dict = daily_event_df_dict + self.daily_event_neighbor_impact = daily_event_neighbor_impact + self.daily_event_shifted_effect = daily_event_shifted_effect self.fs_components_df = fs_components_df self.autoreg_dict = autoreg_dict self.past_df = past_df @@ -180,6 +182,7 @@ def __init__( self.simulation_based = simulation_based self.simulation_num = simulation_num self.fast_simulation = fast_simulation + self.remove_intercept = remove_intercept self.validate_inputs() def validate_inputs(self): @@ -247,6 +250,8 @@ def fit( fit_algorithm=self.fit_algorithm_dict["fit_algorithm"], fit_algorithm_params=self.fit_algorithm_dict["fit_algorithm_params"], daily_event_df_dict=self.daily_event_df_dict, + daily_event_neighbor_impact=self.daily_event_neighbor_impact, + daily_event_shifted_effect=self.daily_event_shifted_effect, fs_components_df=self.fs_components_df, autoreg_dict=self.autoreg_dict, past_df=self.past_df, @@ -264,7 +269,8 @@ def fit( forecast_horizon=self.forecast_horizon, simulation_based=self.simulation_based, simulation_num=self.simulation_num, - fast_simulation=self.fast_simulation) + fast_simulation=self.fast_simulation, + remove_intercept=self.remove_intercept) # sets attributes based on ``self.model_dict`` super().finish_fit() diff --git a/greykite/sklearn/estimator/simple_silverkite_estimator.py b/greykite/sklearn/estimator/simple_silverkite_estimator.py index 405d2e3..de6093a 100644 --- a/greykite/sklearn/estimator/simple_silverkite_estimator.py +++ b/greykite/sklearn/estimator/simple_silverkite_estimator.py @@ -34,7 +34,6 @@ from greykite.common import constants as cst from greykite.common.python_utils import update_dictionary from greykite.sklearn.estimator.base_silverkite_estimator import BaseSilverkiteEstimator -from greykite.sklearn.estimator.silverkite_diagnostics import SilverkiteDiagnostics from greykite.sklearn.uncertainty.uncertainty_methods import UncertaintyMethodEnum @@ -101,7 +100,6 @@ class SimpleSilverkiteEstimator(BaseSilverkiteEstimator): def __init__( self, silverkite: SimpleSilverkiteForecast = SimpleSilverkiteForecast(), - silverkite_diagnostics: SilverkiteDiagnostics = SilverkiteDiagnostics(), score_func: callable = mean_squared_error, coverage: float = None, null_model_params: Optional[Dict] = None, @@ -119,6 +117,8 @@ def __init__( holiday_post_num_days: int = 2, holiday_pre_post_num_dict: Optional[Dict] = None, daily_event_df_dict: Optional[Dict] = None, + daily_event_neighbor_impact: Optional[Union[int, List[int], callable]] = None, + daily_event_shifted_effect: Optional[List[str]] = None, auto_growth: bool = False, changepoints_dict: Optional[Dict] = None, auto_seasonality: bool = False, @@ -146,11 +146,11 @@ def __init__( regression_weight_col: Optional[str] = None, simulation_based: Optional[bool] = False, simulation_num: int = 10, - fast_simulation: bool = False): + fast_simulation: bool = False, + remove_intercept: bool = False): # every subclass of BaseSilverkiteEstimator must call super().__init__ super().__init__( silverkite=silverkite, - silverkite_diagnostics=silverkite_diagnostics, score_func=score_func, coverage=coverage, null_model_params=null_model_params, @@ -175,6 +175,8 @@ def __init__( self.holiday_post_num_days = holiday_post_num_days self.holiday_pre_post_num_dict = holiday_pre_post_num_dict self.daily_event_df_dict = daily_event_df_dict + self.daily_event_neighbor_impact = daily_event_neighbor_impact + self.daily_event_shifted_effect = daily_event_shifted_effect self.auto_growth = auto_growth self.changepoints_dict = changepoints_dict self.auto_seasonality = auto_seasonality @@ -203,6 +205,7 @@ def __init__( self.simulation_based = simulation_based self.simulation_num = simulation_num self.fast_simulation = fast_simulation + self.remove_intercept = remove_intercept # ``forecast_simple_silverkite`` generates a ``fs_components_df`` to call # ``forecast_silverkite`` that is compatible with ``BaseSilverkiteEstimator``. # Unlike ``SilverkiteEstimator``, this does not need to call ``validate_inputs``. @@ -287,6 +290,8 @@ def fit( holiday_post_num_days=self.holiday_post_num_days, holiday_pre_post_num_dict=self.holiday_pre_post_num_dict, daily_event_df_dict=self.daily_event_df_dict, + daily_event_neighbor_impact=self.daily_event_neighbor_impact, + daily_event_shifted_effect=self.daily_event_shifted_effect, auto_growth=self.auto_growth, changepoints_dict=self.changepoints_dict, auto_seasonality=self.auto_seasonality, @@ -314,7 +319,8 @@ def fit( regression_weight_col=self.regression_weight_col, simulation_based=self.simulation_based, simulation_num=self.simulation_num, - fast_simulation=self.fast_simulation) + fast_simulation=self.fast_simulation, + remove_intercept=self.remove_intercept) # Fits the uncertainty model if not already fit. if self.uncertainty_dict is not None and uncertainty_dict is None: diff --git a/greykite/sklearn/estimator/testing_utils.py b/greykite/sklearn/estimator/testing_utils.py index 98fd356..b435e6d 100644 --- a/greykite/sklearn/estimator/testing_utils.py +++ b/greykite/sklearn/estimator/testing_utils.py @@ -112,3 +112,20 @@ def params_components(): "max_admissible_value": None, "uncertainty_dict": uncertainty_dict } + + +def params_component_breakdowns(): + """Parameters for ``plot_components``""" + expected_component_names = [ + "Intercept", + "Regressors", + "Autoregressive", + "Event", + "Trend", + "Seasonality", + "Residuals", + "Smoothed Residuals", + "Changepoints"] + return { + "expected_component_names": expected_component_names, + } diff --git a/greykite/tests/algo/changepoint/adalasso/test_changepoint_detector.py b/greykite/tests/algo/changepoint/adalasso/test_changepoint_detector.py index 4b226b5..9303879 100644 --- a/greykite/tests/algo/changepoint/adalasso/test_changepoint_detector.py +++ b/greykite/tests/algo/changepoint/adalasso/test_changepoint_detector.py @@ -239,7 +239,9 @@ def test_find_trend_changepoints(hourly_data): regularization_strength=-1 ) # estimator parameter combination not valid warning - with pytest.warns(UserWarning) as record: + with pytest.warns( + UserWarning, + match="trend_estimator not in"): model = ChangepointDetector() model.find_trend_changepoints( df=df, @@ -247,9 +249,9 @@ def test_find_trend_changepoints(hourly_data): value_col="y", trend_estimator="something" ) - assert "trend_estimator not in ['ridge', 'lasso', 'ols'], " \ - "estimating using ridge" in record[0].message.args[0] - with pytest.warns(UserWarning) as record: + with pytest.warns( + UserWarning, + match="trend_estimator"): model = ChangepointDetector() model.find_trend_changepoints( df=df, @@ -258,9 +260,9 @@ def test_find_trend_changepoints(hourly_data): trend_estimator="ols", yearly_seasonality_order=8 ) - assert "trend_estimator = 'ols' with year_seasonality_order > 0 may create " \ - "over-fitting, trend_estimator has been set to 'ridge'." in record[0].message.args[0] - with pytest.warns(UserWarning) as record: + with pytest.warns( + UserWarning, + match="adaptive_lasso_initial_estimator"): model = ChangepointDetector() model.find_trend_changepoints( df=df, @@ -268,8 +270,6 @@ def test_find_trend_changepoints(hourly_data): value_col="y", adaptive_lasso_initial_estimator="something" ) - assert "adaptive_lasso_initial_estimator not in ['ridge', 'lasso', 'ols'], " \ - "estimating with ridge" in record[0].message.args[0] # df sample size too small df = pd.DataFrame( data={ @@ -415,16 +415,14 @@ def test_find_seasonality_changepoints(hourly_data): time_col="ts", value_col="y" ) - with pytest.warns(UserWarning) as record: + with pytest.warns( + UserWarning, + match="Trend changepoints are already identified, using past trend estimation."): cd.find_seasonality_changepoints( df=df2, time_col="ts2", value_col="y2" ) - assert ("Trend changepoints are already identified, using past trend estimation. " - "If you would like to run trend change point detection again, " - "please call ``find_trend_changepoints`` with desired parameters " - "before calling ``find_seasonality_changepoints``.") in record[0].message.args[0] assert cd.time_col == "ts" assert cd.value_col == "y" # negative potential_changepoint_n @@ -446,16 +444,15 @@ def test_find_seasonality_changepoints(hourly_data): regularization_strength=-1 ) # test regularization_strength == None warning - with pytest.warns(UserWarning) as record: + with pytest.warns( + UserWarning, + match="regularization_strength is set to None. This will"): model.find_seasonality_changepoints( df=df, time_col="ts", value_col="y", regularization_strength=None ) - assert ("regularization_strength is set to None. This will trigger cross-validation to " - "select the tuning parameter which might result in too many change points. " - "Keep the default value or tuning around it is recommended.") in record[0].message.args[0] # test existing trend estimation warning model = ChangepointDetector() model.find_trend_changepoints( @@ -463,16 +460,14 @@ def test_find_seasonality_changepoints(hourly_data): time_col="ts", value_col="y" ) - with pytest.warns(UserWarning) as record: + with pytest.warns( + UserWarning, + match="Trend changepoints are already identified, using past trend estimation."): model.find_seasonality_changepoints( df=df, time_col="ts", value_col="y" ) - assert ("Trend changepoints are already identified, using past trend estimation. " - "If you would like to run trend change point detection again, " - "please call ``find_trend_changepoints`` with desired parameters " - "before calling ``find_seasonality_changepoints``.") in record[0].message.args[0] # df sample size too small df_small = pd.DataFrame( data={ @@ -513,7 +508,9 @@ def test_plot(hourly_data): value_col="y" ) # test empty plot - with pytest.warns(UserWarning) as record: + with pytest.warns( + UserWarning, + match="Figure is empty, at least one component"): model.plot( observation=False, observation_original=False, @@ -522,9 +519,10 @@ def test_plot(hourly_data): yearly_seasonality_estimate=False, adaptive_lasso_estimate=False ) - assert "Figure is empty, at least one component has to be true." in record[0].message.args[0] # test plotting change without estimation - with pytest.warns(UserWarning) as record: + with pytest.warns( + UserWarning, + match="You haven't run trend change point detection "): model = ChangepointDetector() model.plot( observation=False, @@ -534,10 +532,10 @@ def test_plot(hourly_data): yearly_seasonality_estimate=False, adaptive_lasso_estimate=False ) - assert "You haven't run trend change point detection algorithm yet. " \ - "Please call find_trend_changepoints first." in record[0].message.args[0] # test plotting seasonality change or estimation without estimation - with pytest.warns(UserWarning) as record: + with pytest.warns( + UserWarning, + match="You haven't run seasonality change point detection"): model = ChangepointDetector() model.plot( observation=False, @@ -548,9 +546,9 @@ def test_plot(hourly_data): adaptive_lasso_estimate=False, seasonality_change=True ) - assert ("You haven't run seasonality change point detection algorithm yet. " - "Please call find_seasonality_changepoints first.") in record[0].message.args[0] - with pytest.warns(UserWarning) as record: + with pytest.warns( + UserWarning, + match="You haven't run seasonality change point detection"): model = ChangepointDetector() model.plot( observation=False, @@ -561,8 +559,6 @@ def test_plot(hourly_data): adaptive_lasso_estimate=False, seasonality_estimate=True ) - assert ("You haven't run seasonality change point detection algorithm yet. " - "Please call find_seasonality_changepoints first.") in record[0].message.args[0] def test_get_changepoints_dict(): @@ -666,15 +662,15 @@ def test_get_changepoints_dict(): "method": "auto", "unused_key": "value" } - with pytest.warns(UserWarning) as record: + with pytest.warns( + UserWarning, + match="The following keys in"): get_changepoints_dict( df=df_pt, time_col="ts", value_col="y", changepoints_dict=changepoints_dict ) - assert (f"The following keys in ``changepoints_dict`` are not recognized\n" - f"{['unused_key']}") in record[0].message.args[0] def test_get_seasonality_changepoints(): diff --git a/greykite/tests/algo/common/test_holiday_grouper.py b/greykite/tests/algo/common/test_holiday_grouper.py new file mode 100644 index 0000000..5b2743c --- /dev/null +++ b/greykite/tests/algo/common/test_holiday_grouper.py @@ -0,0 +1,306 @@ +import pandas as pd +import pytest + +from greykite.algo.common.holiday_grouper import HolidayGrouper +from greykite.algo.common.holiday_utils import HOLIDAY_NAME_COL +from greykite.algo.common.holiday_utils import get_weekday_weekend_suffix +from greykite.common.constants import EVENT_DF_DATE_COL +from greykite.common.constants import EVENT_DF_LABEL_COL +from greykite.common.constants import TIME_COL +from greykite.common.constants import VALUE_COL +from greykite.common.data_loader import DataLoader +from greykite.common.features.timeseries_features import get_holidays +from greykite.common.python_utils import assert_equal + + +@pytest.fixture +def daily_df(): + df = DataLoader().load_peyton_manning() + df[TIME_COL] = pd.to_datetime(df[TIME_COL]) + return df + + +@pytest.fixture +def holiday_df(): + holiday_df = get_holidays(countries=["US"], year_start=2010, year_end=2016)["US"] + # Only keeps non-observed holidays. + holiday_df = holiday_df[~holiday_df["event_name"].str.contains("Observed")] + return holiday_df.sort_values(by=[EVENT_DF_DATE_COL, EVENT_DF_LABEL_COL]).reset_index(drop=True) + + +HOLIDAY_IMPACT_DICT = { + "Christmas Day": (4, 3), # Always 12/25. + "Halloween": (1, 1), # Always 10/31. + "Independence Day": (4, 4), # Always 7/4. + "Labor Day": (3, 1), # Monday. + "Martin Luther King Jr. Day": (3, 1), # Monday. + "Memorial Day": (3, 1), # Monday. + "New Year's Day": (3, 4), # Always 1/1. + "Thanksgiving": (1, 4), # Thursday. +} + + +def test_expand_holiday_df_with_suffix(holiday_df): + """Tests `expand_holiday_df_with_suffix` function.""" + # Tests the case when no change is made. + expanded_holiday_df = HolidayGrouper.expand_holiday_df_with_suffix( + holiday_df=holiday_df, + holiday_date_col=EVENT_DF_DATE_COL, + holiday_name_col=EVENT_DF_LABEL_COL, + holiday_impact_dict=None, + get_suffix_func=None + ).sort_values(by=[EVENT_DF_DATE_COL, EVENT_DF_LABEL_COL]).reset_index(drop=True) + + assert_equal(expanded_holiday_df, holiday_df) + + # When unknown holidays are present in `holiday_impact_dict`, result remains the same. + expanded_holiday_df = HolidayGrouper.expand_holiday_df_with_suffix( + holiday_df=holiday_df, + holiday_date_col=EVENT_DF_DATE_COL, + holiday_name_col=EVENT_DF_LABEL_COL, + holiday_impact_dict={"unknown": [1, 1]}, + get_suffix_func=None + ).sort_values(by=[EVENT_DF_DATE_COL, EVENT_DF_LABEL_COL]).reset_index(drop=True) + + assert_equal(expanded_holiday_df, holiday_df) + + # Tests the case when only neighboring days are added. + expanded_holiday_df = HolidayGrouper.expand_holiday_df_with_suffix( + holiday_df=holiday_df, + holiday_date_col=EVENT_DF_DATE_COL, + holiday_name_col=EVENT_DF_LABEL_COL, + holiday_impact_dict=HOLIDAY_IMPACT_DICT, + get_suffix_func=None + ).sort_values(by=[EVENT_DF_DATE_COL, EVENT_DF_LABEL_COL]).reset_index(drop=True) + + # Spot checks a few events are being correctly added. + assert "Christmas Day_minus_4" in expanded_holiday_df[EVENT_DF_LABEL_COL].tolist() + assert "New Year's Day_plus_4" in expanded_holiday_df[EVENT_DF_LABEL_COL].tolist() + + # Checks the expected total number of events. + expected_diff = 0 + for event, (pre, post) in HOLIDAY_IMPACT_DICT.items(): + count = (holiday_df[EVENT_DF_LABEL_COL] == event).sum() + additional_days = (pre + post) * count + expected_diff += additional_days + assert len(expanded_holiday_df) - len(holiday_df) == expected_diff + + # Tests the case when both neighboring days and suffixes are added. + expanded_holiday_df = HolidayGrouper.expand_holiday_df_with_suffix( + holiday_df=holiday_df, + holiday_date_col=EVENT_DF_DATE_COL, + holiday_name_col=EVENT_DF_LABEL_COL, + holiday_impact_dict=HOLIDAY_IMPACT_DICT, + get_suffix_func=get_weekday_weekend_suffix + ).sort_values(by=[EVENT_DF_DATE_COL, EVENT_DF_LABEL_COL]).reset_index(drop=True) + + # Checks an instance where New Year's Day falls on Friday. + idx = expanded_holiday_df["event_name"].str.contains("New Year's Day_WD_plus_1_WE") + assert expanded_holiday_df.loc[idx, EVENT_DF_DATE_COL].values[0] == pd.to_datetime("2010-01-02") + assert expanded_holiday_df.loc[idx, EVENT_DF_DATE_COL].values[1] == pd.to_datetime("2016-01-02") + + # Checks an instance where Christmas Day falls on Sunday and is observed on Monday. + idx = expanded_holiday_df["event_name"].str.contains("Christmas Day_WE_plus_1_WD") + assert expanded_holiday_df.loc[idx, EVENT_DF_DATE_COL].values[0] == pd.to_datetime("2011-12-26") + assert expanded_holiday_df.loc[idx, EVENT_DF_DATE_COL].values[1] == pd.to_datetime("2016-12-26") + + # Checks an instance where Labor Day always fall on Monday for all years. + idx = expanded_holiday_df["event_name"].str.contains("Labor Day_WD_minus_1_WE") + assert idx.sum() == 7 + + # Tests unknown `get_suffix_func`. + with pytest.raises(NotImplementedError, match="is not supported"): + HolidayGrouper.expand_holiday_df_with_suffix( + holiday_df=holiday_df, + holiday_date_col=EVENT_DF_DATE_COL, + holiday_name_col=EVENT_DF_LABEL_COL, + holiday_impact_dict=None, + get_suffix_func="unknown" + ) + + +def test_holiday_grouper_init(daily_df, holiday_df): + """Tests the initialization of `HolidayGrouper`.""" + # Tests pre-processing. + hg = HolidayGrouper( + df=daily_df, + time_col=TIME_COL, + value_col=VALUE_COL, + holiday_df=holiday_df, + holiday_date_col=EVENT_DF_DATE_COL, + holiday_name_col=EVENT_DF_LABEL_COL, + holiday_impact_dict=None, + get_suffix_func=None + ) + # After initialization, a new column (is not already exists) + # will be added to `holiday_df` and `expanded_holiday_df`. + assert HOLIDAY_NAME_COL in hg.holiday_df.columns + assert HOLIDAY_NAME_COL in hg.expanded_holiday_df.columns + + +def test_group_holidays(daily_df, holiday_df): + """Tests `get_holiday_scores` and `group_holidays` functions.""" + default_get_suffix_func = "wd_we" + default_baseline_offsets = (-7, 7) + default_use_relative_score = True + + # Initializes the holiday grouper. + hg = HolidayGrouper( + df=daily_df, + time_col=TIME_COL, + value_col=VALUE_COL, + holiday_df=holiday_df, + holiday_date_col=EVENT_DF_DATE_COL, + holiday_name_col=EVENT_DF_LABEL_COL, + holiday_impact_dict=HOLIDAY_IMPACT_DICT, + get_suffix_func=default_get_suffix_func + ) + assert hg.get_suffix_func == default_get_suffix_func + assert hg.baseline_offsets is None + assert hg.use_relative_score is None + + # Runs the holiday grouper with KDE-based clustering. + min_n_days = 2 + min_same_sign_ratio = 0.66 + min_abs_avg_score = 0.02 + bandwidth_multiplier = 0.2 + + hg.group_holidays( + min_n_days=min_n_days, + min_same_sign_ratio=min_same_sign_ratio, + min_abs_avg_score=min_abs_avg_score, + clustering_method="kde", + bandwidth_multiplier=bandwidth_multiplier + ) + + # Checks the attributes are overriden. + assert hg.baseline_offsets == default_baseline_offsets + assert hg.use_relative_score == default_use_relative_score + assert hg.clustering_method == "kde" + assert hg.result_dict is not None + assert hg.bandwidth is not None + assert hg.bandwidth_multiplier == bandwidth_multiplier + assert hg.kde is not None + + # Checks correctness of the grouping results. + result_dict = hg.result_dict.copy() + + expected_keys = [ + "holiday_inferrer", + "score_result_original", + "score_result_avg_original", + "score_result", + "score_result_avg", + "daily_event_df_dict_with_score", + "daily_event_df_dict", + "kde_cutoffs", + "kde_res", + "kde_plot" + ] + for key in expected_keys: + assert result_dict[key] is not None + + assert len(result_dict["score_result_original"]) == len(result_dict["score_result_avg_original"]) + assert len(result_dict["score_result"]) == len(result_dict["score_result_avg"]) + + # Checks if the pruning works as expected. + for key, value in result_dict["score_result_avg"].items(): + assert abs(result_dict["score_result_avg"][key]) >= min_abs_avg_score + assert len(result_dict["score_result"][key]) >= min_n_days + + # Checks the grouped holidays output. + for event_df in result_dict["daily_event_df_dict"].values(): + assert event_df.shape[1] == 2 + assert EVENT_DF_DATE_COL in event_df.columns + assert EVENT_DF_LABEL_COL in event_df.columns + + for event_df in result_dict["daily_event_df_dict_with_score"].values(): + assert event_df.shape[1] == 4 + assert EVENT_DF_DATE_COL in event_df.columns + assert EVENT_DF_LABEL_COL in event_df.columns + + assert len(result_dict["daily_event_df_dict"]) == 4 + + # Runs again with a bigger bandwidth for clustering. + new_bandwidth_multiplier = 1 + + hg.group_holidays( + min_n_days=min_n_days, + min_same_sign_ratio=min_same_sign_ratio, + min_abs_avg_score=min_abs_avg_score, + clustering_method="kde", + bandwidth_multiplier=new_bandwidth_multiplier + ) + assert hg.bandwidth_multiplier == new_bandwidth_multiplier + + # Checks results. + new_result_dict = hg.result_dict.copy() + + # New grouping has fewer groups due to a relaxed bandwidth. + assert len(new_result_dict["daily_event_df_dict"]) == 3 + + # Scoring results have not changed. + for key in [ + "score_result", + "score_result_avg" + ]: + assert new_result_dict[key] == result_dict[key] + + # Grouping results changed. + for key in [ + "daily_event_df_dict_with_score", + "daily_event_df_dict", + "kde_cutoffs", + "kde_res", + "kde_plot" + ]: + with pytest.raises(AssertionError): + assert_equal(new_result_dict[key], result_dict[key]) + + # Runs holiday grouper with k-means clustering. + n_clusters = 5 + hg.group_holidays( + min_n_days=min_n_days, + min_same_sign_ratio=min_same_sign_ratio, + min_abs_avg_score=min_abs_avg_score, + clustering_method="kmeans", + n_clusters=n_clusters, + include_diagnostics=True + ) + + # Checks the attributes are overriden. + assert hg.baseline_offsets == default_baseline_offsets + assert hg.use_relative_score == default_use_relative_score + assert hg.clustering_method == "kmeans" + assert hg.result_dict is not None + assert hg.n_clusters == n_clusters + assert hg.kmeans is not None + + # Checks results. + new_result_dict = hg.result_dict.copy() + expected_keys = [ + "holiday_inferrer", + "score_result_original", + "score_result_avg_original", + "score_result", + "score_result_avg", + "daily_event_df_dict_with_score", + "daily_event_df_dict", + "kmeans_diagnostics", + "kmeans_plot" + ] + for key in expected_keys: + assert new_result_dict[key] is not None + # Checks the number of groups matches the input. + assert len(new_result_dict["daily_event_df_dict"]) == n_clusters + + # Checks invalid clustering method. + # Tests unknown `get_suffix_func`. + with pytest.raises(NotImplementedError, match="is not supported"): + hg.group_holidays( + min_n_days=min_n_days, + min_same_sign_ratio=min_same_sign_ratio, + min_abs_avg_score=min_abs_avg_score, + clustering_method="unknown", + bandwidth_multiplier=new_bandwidth_multiplier + ) diff --git a/greykite/tests/algo/common/test_holiday_inferrer.py b/greykite/tests/algo/common/test_holiday_inferrer.py index 9451bfa..0f28e7e 100644 --- a/greykite/tests/algo/common/test_holiday_inferrer.py +++ b/greykite/tests/algo/common/test_holiday_inferrer.py @@ -76,7 +76,7 @@ def test_infer_daily_data(daily_df): plot=True ) # Checks result. - assert len(result["scores"]) == 50 + assert len(result["scores"]) == 55 assert sorted(result["independent_holidays"]) == sorted([ ('US', 'Labor Day_+0'), ('US', 'Labor Day_-1'), @@ -98,37 +98,41 @@ def test_infer_daily_data(daily_df): ('US', 'Columbus Day_+0'), ('US', 'Memorial Day_+1'), ('US', 'Labor Day_+1'), - ('US', "New Year's Day_+0"), ('US', 'Martin Luther King Jr. Day_-1'), ('US', 'Independence Day_-2'), ('US', 'Christmas Day_-1'), ('US', 'Independence Day_-1'), + ('US', 'Martin Luther King Jr. Day_+1'), + ('US', 'Halloween_+0'), + ('US', 'Halloween_+2'), ]) assert sorted(result["together_holidays_positive"]) == sorted([ - ('US', 'Martin Luther King Jr. Day_+1'), ('US', 'Martin Luther King Jr. Day_+2'), ('US', 'Labor Day_+2'), ('US', 'Thanksgiving_+1'), ('US', 'Memorial Day_+2'), ('US', "New Year's Day_-1"), ('US', "New Year's Day_-2"), - ('US', "New Year's Day_+2") + ('US', "New Year's Day_+2"), + ('US', 'Halloween_-1'), + ('US', 'Halloween_-2'), ]) assert sorted(result["together_holidays_negative"]) == sorted([ - ('US', 'Independence Day_+1'), - ('US', 'Independence Day_+0'), + ('US', 'Christmas Day_+2'), + ('US', 'Christmas Day_-2'), ('US', 'Columbus Day_+1'), + ('US', 'Columbus Day_+2'), + ('US', 'Columbus Day_-2'), + ('US', 'Halloween_+1'), + ('US', 'Independence Day_+0'), + ('US', 'Independence Day_+1'), ('US', 'Martin Luther King Jr. Day_-2'), - ('US', 'Veterans Day_+2'), - ('US', 'Memorial Day_-2'), ('US', 'Memorial Day_-1'), - ('US', 'Columbus Day_-2'), - ('US', 'Columbus Day_+2'), - ('US', 'Christmas Day_-2'), - ('US', 'Christmas Day_+2'), - ('US', 'Independence Day_+2'), + ('US', 'Memorial Day_-2'), + ('US', "New Year's Day_+0"), + ('US', 'Veterans Day_+2') ]) assert len(result["fig"].data) == 6 @@ -140,9 +144,9 @@ def test_infer_daily_data(daily_df): assert hi.year_end == 2016 assert len(hi.ts) == len(hi.df) assert hi.country_holiday_df is not None - assert len(hi.holidays) == 10 - assert len(hi.score_result) == 50 - assert len(hi.score_result_avg) == 50 + assert len(hi.holidays) == 11 + assert len(hi.score_result) == 55 + assert len(hi.score_result_avg) == 55 assert hi.result == result @@ -160,14 +164,14 @@ def test_daily_data_diff_params(daily_df): together_holiday_thres=0.8 ) # Checks result. - assert len(result["scores"]) == 196 + assert len(result["scores"]) == 200 assert "US_Christmas Day_+0" in result["scores"] assert "US_New Year's Day_-3" in result["scores"] assert "India_All Saints Day_-1" in result["scores"] assert len(result["fig"].data) == 5 - assert len(result["independent_holidays"]) == 104 - assert len(result["together_holidays_positive"]) == 10 - assert len(result["together_holidays_negative"]) == 9 + assert len(result["independent_holidays"]) == 100 + assert len(result["together_holidays_positive"]) == 12 + assert len(result["together_holidays_negative"]) == 10 # Checks attributes. assert hi.df is not None assert hi.time_col == TIME_COL @@ -176,9 +180,9 @@ def test_daily_data_diff_params(daily_df): assert hi.year_end == 2016 assert len(hi.ts) == len(hi.df) assert hi.country_holiday_df is not None - assert len(hi.holidays) == 49 - assert len(hi.score_result) == 196 - assert len(hi.score_result_avg) == 196 + assert len(hi.holidays) == 50 + assert len(hi.score_result) == 200 + assert len(hi.score_result_avg) == 200 assert hi.result == result @@ -195,7 +199,7 @@ def test_sub_daily_data(): plot=True ) # Checks result. - assert len(result["scores"]) == 55 + assert len(result["scores"]) == 60 assert len(result["fig"].data) == 6 assert len(result["independent_holidays"]) == 7 assert len(result["together_holidays_negative"]) == 0 @@ -208,9 +212,9 @@ def test_sub_daily_data(): assert hi.year_end == 2020 assert len(hi.ts) == len(hi.df) assert hi.country_holiday_df is not None - assert len(hi.holidays) == 11 - assert len(hi.score_result) == 55 - assert len(hi.score_result_avg) == 55 + assert len(hi.holidays) == 12 + assert len(hi.score_result) == 60 + assert len(hi.score_result_avg) == 60 assert hi.result == result @@ -255,18 +259,31 @@ def test_get_scores(): "ts": pd.date_range("2020-12-01", freq="D", periods=40), "y": [1] * 22 + [2, 3, 4, 3, 2, 1, 1, 2, 3, 4, 3, 2] + [1] * 6 }) - result = hi.infer_holidays(df=df) + result = hi.infer_holidays(df=df, use_relative_score=False) assert result["scores"]["US_Christmas Day_-2"] == [0.5] assert result["scores"]["US_Christmas Day_-1"] == [1.0] - assert result["scores"]["US_Christmas Day_+0"] == [1.5] + assert result["scores"]["US_Christmas Day_+0"] == [3.0] assert result["scores"]["US_Christmas Day_+1"] == [1.0] assert result["scores"]["US_Christmas Day_+2"] == [0.5] assert result["scores"]["US_New Year's Day_-2"] == [0.5] assert result["scores"]["US_New Year's Day_-1"] == [1.0] - assert result["scores"]["US_New Year's Day_+0"] == [1.5] + assert result["scores"]["US_New Year's Day_+0"] == [3.0] assert result["scores"]["US_New Year's Day_+1"] == [1.0] assert result["scores"]["US_New Year's Day_+2"] == [0.0] + # Tests the relative scores + result = hi.infer_holidays(df=df, use_relative_score=True) + assert result["scores"]["US_Christmas Day_-2"] == [1/3] + assert result["scores"]["US_Christmas Day_-1"] == [0.5] + assert result["scores"]["US_Christmas Day_+0"] == [3.0] + assert result["scores"]["US_Christmas Day_+1"] == [0.5] + assert result["scores"]["US_Christmas Day_+2"] == [1/3] + assert result["scores"]["US_New Year's Day_-2"] == [1/3] + assert result["scores"]["US_New Year's Day_-1"] == [0.5] + assert result["scores"]["US_New Year's Day_+0"] == [3.0] + assert result["scores"]["US_New Year's Day_+1"] == [0.5] + assert result["scores"]["US_New Year's Day_+2"] == [0.0] + def test_remove_observed(): """Tests the observed holidays are accurately renamed @@ -277,10 +294,10 @@ def test_remove_observed(): }) # 5 observed holidays are removed. country_holiday_df = get_holiday_df(country_list=["US"], years=[2020, 2021, 2022]) - assert len(country_holiday_df) == 39 + assert len(country_holiday_df) == 42 hi = HolidayInferrer() hi.infer_holidays(df=df) - assert len(hi.country_holiday_df) == 32 + assert len(hi.country_holiday_df) == 35 # Tests the correctness of observed holiday removal. # 2020-07-03 is the observed Independence Day for 2022-07-04. assert hi.country_holiday_df[ @@ -381,7 +398,7 @@ def test_get_daily_event_dict(daily_df): # Infers holidays and call. hi.infer_holidays(df=daily_df) daily_event_dict = hi.generate_daily_event_dict() - assert len(daily_event_dict) == 27 + assert len(daily_event_dict) == 29 assert "Holiday_positive_group" in daily_event_dict assert "Holiday_negative_group" in daily_event_dict # Every single holiday should cover 11 years. diff --git a/greykite/tests/algo/common/test_holiday_utils.py b/greykite/tests/algo/common/test_holiday_utils.py new file mode 100644 index 0000000..670beaf --- /dev/null +++ b/greykite/tests/algo/common/test_holiday_utils.py @@ -0,0 +1,52 @@ +import pandas as pd + +from greykite.algo.common.holiday_utils import get_dow_grouped_suffix +from greykite.algo.common.holiday_utils import get_weekday_weekend_suffix + + +def test_get_dow_grouped_suffix(): + """Tests `get_dow_grouped_suffix` function.""" + date = pd.to_datetime("2023-01-01") + assert get_dow_grouped_suffix(date) == "_Sun" + + date = pd.to_datetime("2023-01-02") + assert get_dow_grouped_suffix(date) == "_WD" + + date = pd.to_datetime("2023-01-03") + assert get_dow_grouped_suffix(date) == "_WD" + + date = pd.to_datetime("2023-01-04") + assert get_dow_grouped_suffix(date) == "_WD" + + date = pd.to_datetime("2023-01-05") + assert get_dow_grouped_suffix(date) == "_WD" + + date = pd.to_datetime("2023-01-06") + assert get_dow_grouped_suffix(date) == "_WD" + + date = pd.to_datetime("2023-01-07") + assert get_dow_grouped_suffix(date) == "_Sat" + + +def test_get_weekday_weekend_suffix(): + """Tests `get_weekday_weekend_suffix` function.""" + date = pd.to_datetime("2023-01-01") + assert get_weekday_weekend_suffix(date) == "_WE" + + date = pd.to_datetime("2023-01-02") + assert get_weekday_weekend_suffix(date) == "_WD" + + date = pd.to_datetime("2023-01-03") + assert get_weekday_weekend_suffix(date) == "_WD" + + date = pd.to_datetime("2023-01-04") + assert get_weekday_weekend_suffix(date) == "_WD" + + date = pd.to_datetime("2023-01-05") + assert get_weekday_weekend_suffix(date) == "_WD" + + date = pd.to_datetime("2023-01-06") + assert get_weekday_weekend_suffix(date) == "_WD" + + date = pd.to_datetime("2023-01-07") + assert get_weekday_weekend_suffix(date) == "_WE" diff --git a/greykite/tests/algo/common/test_l1_quantile_regression.py b/greykite/tests/algo/common/test_l1_quantile_regression.py index ceaa4e2..a2f4a8a 100644 --- a/greykite/tests/algo/common/test_l1_quantile_regression.py +++ b/greykite/tests/algo/common/test_l1_quantile_regression.py @@ -106,6 +106,28 @@ def test_quantile_regression_fit_predict_l1(data): assert round(sum(pred > data["y"]) / len(data["y"]), 1) == 0.9 +def test_quantile_regression_fit_predict_mape(data): + """Tests fitting quantile regression fit and predict optimizing mape.""" + qr = QuantileRegression( + quantile=0.8, + alpha=0, + optimize_mape=True + ) + with LogCapture(LOGGER_NAME) as log_capture: + qr.fit(data["x"], data["y"]) + qr.predict(data["x"]) + assert qr.quantile == 0.5 + assert all(qr.sample_weight == 1 / np.abs(data["y"])) + log_capture.check_present(( + LOGGER_NAME, + "WARNING", + "The parameter 'optimize_mape' is set to 'True', " + "ignoring the input 'quantile' and 'sample_weight', " + "setting 'quantile' to 0.5 and 'sample_weight' to the inverse " + "absolute values of the response." + )) + + def test_errors(data): """Tests errors.""" # y is not a column vector. diff --git a/greykite/tests/algo/common/test_ml_models.py b/greykite/tests/algo/common/test_ml_models.py index 4fe81a1..e686990 100644 --- a/greykite/tests/algo/common/test_ml_models.py +++ b/greykite/tests/algo/common/test_ml_models.py @@ -1,20 +1,30 @@ +import warnings + import numpy as np import pandas as pd import pytest -from pandas.util.testing import assert_frame_equal +import scipy +from pandas.testing import assert_frame_equal +from patsy import dmatrices from greykite.algo.common.ml_models import breakdown_regression_based_prediction from greykite.algo.common.ml_models import design_mat_from_formula from greykite.algo.common.ml_models import fit_ml_model from greykite.algo.common.ml_models import fit_ml_model_with_evaluation from greykite.algo.common.ml_models import fit_model_via_design_matrix +from greykite.algo.common.ml_models import get_h_mat +from greykite.algo.common.ml_models import get_intercept_col_from_design_mat from greykite.algo.common.ml_models import predict_ml from greykite.algo.common.ml_models import predict_ml_with_uncertainty from greykite.algo.uncertainty.conditional.conf_interval import predict_ci from greykite.common.constants import ERR_STD_COL from greykite.common.constants import QUANTILE_SUMMARY_COL +from greykite.common.constants import TimeFeaturesEnum from greykite.common.evaluation import EvaluationMetricEnum from greykite.common.evaluation import calc_pred_err +from greykite.common.features.timeseries_features import build_time_features_df +from greykite.common.features.timeseries_features import fourier_series_multi_fcn +from greykite.common.features.timeseries_features import get_fourier_col_name from greykite.common.python_utils import assert_equal from greykite.common.testing_utils import gen_sliced_df @@ -60,12 +70,151 @@ def data_with_weights(): "model_formula_str": model_formula_str} +@pytest.fixture +def time_series_data(): + """Generate some timeseris data which is useful for testing ML models. + We do not only rely on functions in ``greykite.common.testing_utils`` + This function includes some operations which are not necessary in general: + e.g. (a) adding many random features (b) including a large number of fourier terms + in the features.""" + np.random.seed(1317) + + data_size = 600 + train_size = 500 + date_list = pd.date_range( + start="2010-01-01", + periods=data_size, + freq="D").tolist() + time_col = "ts" + df0 = pd.DataFrame({time_col: date_list}) + time_df = build_time_features_df( + dt=df0[time_col], + conti_year_origin=2010) + + df = pd.concat([df0, time_df], axis=1) + df["growth"] = 0.5 * (df[TimeFeaturesEnum.ct1.value] ** 1.05) + + # We generate a large number of Fourier terms (useful to see if regularization works) + func = fourier_series_multi_fcn( + col_names=[ + TimeFeaturesEnum.toy.value, + TimeFeaturesEnum.tow.value, + TimeFeaturesEnum.tod.value], + periods=[1.0, 7.0, 24.0], + orders=[50, 30, 30], + seas_names=None) + + res = func(df) + df_seas = res["df"] + df = pd.concat([df, df_seas], axis=1) + + fs_coefs = [-1, 3, 4] + intercept = 3.0 + noise_std = 0.1 + + df["y"] = abs( + intercept + + df["growth"] + + fs_coefs[0] * df[get_fourier_col_name(1, TimeFeaturesEnum.tod.value, function_name="sin")] + + fs_coefs[1] * df[get_fourier_col_name(1, TimeFeaturesEnum.tow.value, function_name="sin")] + + fs_coefs[2] * df[get_fourier_col_name(1, TimeFeaturesEnum.toy.value, function_name="sin")] + + noise_std * np.random.normal(size=df.shape[0])) + + # Adds 100 variables without predictive power (randomly generated) + for i in range(100): + df[f"x{i}"] = np.random.normal(size=len(df)) + + feature_cols = ["growth"] + list(df_seas.columns) + [f"x{i}" for i in range(100)] + + # Defines train and test sets + x_train = df[feature_cols][:train_size] + y_train = df["y"][:train_size] + + x_test = df[feature_cols][train_size:] + y_test = df["y"][train_size:] + + df_train = df[:train_size] + df_test = df[train_size:] + + return { + "df": df, + "df_train": df_train, + "df_test": df_test, + "x_train": x_train, + "y_train": y_train, + "x_test": x_test, + "y_test": y_test, + "feature_cols": feature_cols + } + + +def test_get_intercept_col_from_design_mat(): + """Tests getting explicit or implicit intercept column.""" + df = pd.DataFrame({ + "y": 1, + "a": ["a", "b", "c", "a"], + "b": ["d", "d", "e", "e"], + "c": 2 + }) + # With explicit intercept. + _, x = dmatrices( + "y~c+C(a, levels=['a', 'b', 'c'])+C(b, levels=['d', 'e'])+a:b+a:c", + data=df, + return_type="dataframe") + assert "Intercept" in x.columns + assert get_intercept_col_from_design_mat(x) == "Intercept" + + # With implicit intercept. + _, x = dmatrices( + "y~c+C(a, levels=['a', 'b', 'c'])+C(b, levels=['d', 'e'])+a:b+a:c+0", + data=df, + return_type="dataframe") + assert "Intercept" not in x.columns + assert get_intercept_col_from_design_mat(x) == "C(a, levels=['a', 'b', 'c'])[a]" + + # Without intercept. + _, x = dmatrices( + "y~c+0", + data=df, + return_type="dataframe") + assert "Intercept" not in x.columns + assert get_intercept_col_from_design_mat(x) is None + + def test_design_mat_from_formula(design_mat_info): """Tests design_mat_from_formula""" assert design_mat_info["x_mat"]["x1"][0] == 1 assert design_mat_info["y_col"] == "y" +def test_design_mat_from_formula_remove_intercept(): + """Tests `design_mat_from_formula` with removing intercept.""" + df = pd.DataFrame({ + "y": 1, + "a": ["a", "b", "c", "a"], + "b": ["d", "d", "e", "e"], + "c": 2 + }) + # With explicit intercept. + formula = "y~c+C(a, levels=['a', 'b', 'c'])+C(b, levels=['d', 'e'])+a:b+a:c" + result = design_mat_from_formula( + df=df, + model_formula_str=formula, + remove_intercept=True + ) + assert "Intercept" not in result["x_mat"].columns + + # With implicit intercept. + formula = "y~c+C(a, levels=['a', 'b', 'c'])+C(b, levels=['d', 'e'])+a:b+a:c+0" + result = design_mat_from_formula( + df=df, + model_formula_str=formula, + remove_intercept=True + ) + assert "C(a, levels=['a', 'b', 'c'])[a]" not in result["x_mat"].columns + assert "C(a, levels=['a', 'b', 'c'])[b]" in result["x_mat"].columns + + def test_fit_model_via_design_matrix(design_mat_info): """Tests fit_model_via_design_matrix""" x_train = design_mat_info["x_mat"] @@ -121,7 +270,88 @@ def test_fit_model_via_design_matrix(design_mat_info): sample_weight=sample_weight) +def test_fit_model_via_design_matrix_various_algo(time_series_data): + """Tests ``fit_model_via_design_matrix`` with various algos. + This test is to insure that the implemented algorithms have the expected + behaviuor. To that we check for the performance of the algorithms in terms + of test error on simulated data.""" + x_train = time_series_data["x_train"] + y_train = time_series_data["y_train"] + x_test = time_series_data["x_test"] + y_test = time_series_data["y_test"] + # Small number of features to be used with unregularized / unstable algorithms + feature_cols_minimal = [ + "growth", + "sin1_toy", + "cos1_toy", + "sin2_toy", + "cos2_toy", + "sin1_tow", + "cos1_tow", + "sin2_tow", + "cos2_tow"] + + # We consider two cases: + # (a) algorithms which are stable (handle large number of features) + # (b) algorithms which are unstable (do not handle large number of features) + # For (a) we test with large number of features and for (b) a small number of features. + # Temporarily removes `lars` and `lasso_lars` since they have unstable performance + # under linux and Mac. + fit_algorithms = [ + "rf", + "ridge", + "lasso", + # "lars", + "gradient_boosting", + # "lasso_lars", + "sgd", + "elastic_net" + ] + fit_algorithms_unstable = ["linear", "quantile_regression", "statsmodels_glm"] + + # Expected error for each algo (in terms of R2): + expected_r2_dict = { + "rf": 0.96, + "ridge": 0.92, + "lasso": 0.98, + # "lars": 0.97, + "gradient_boosting": 0.98, + # "lasso_lars": 0.98, + "sgd": 0.93, + "elastic_net": 0.97, + "linear": 0.95, + "quantile_regression": 0.97, + "statsmodels_glm": 0.9} + + # Case (a) + for fit_algorithm in fit_algorithms: + ml_model = fit_model_via_design_matrix( + x_train=x_train, # A large number of features appear in ``x_train`` + y_train=y_train, + fit_algorithm=fit_algorithm) + + y_test_pred = ml_model.predict(x_test) + + err = calc_pred_err(y_test, y_test_pred) + r2 = err[(EvaluationMetricEnum.Correlation.get_metric_name())] + assert r2 == pytest.approx(expected_r2_dict[fit_algorithm], rel=2e-2) + + # Case (b) + for fit_algorithm in fit_algorithms_unstable: + ml_model = fit_model_via_design_matrix( + x_train=x_train[feature_cols_minimal], # A small number of features only + y_train=y_train, + fit_algorithm=fit_algorithm) + + y_test_pred = ml_model.predict(x_test[feature_cols_minimal]) + + err = calc_pred_err(y_test, y_test_pred) + r2 = err[(EvaluationMetricEnum.Correlation.get_metric_name())] + assert r2 == pytest.approx(expected_r2_dict[fit_algorithm], rel=2e-2) + + def test_fit_model_via_design_matrix_with_weights(data_with_weights): + """Tests ``fit_model_via_design_matrix`` with weights.""" df = data_with_weights["df"] design_mat_info = data_with_weights["design_mat_info"] x_train = design_mat_info["x_mat"] @@ -154,9 +384,9 @@ def test_fit_model_via_design_matrix_with_weights(data_with_weights): # we expect to see two trends for y w.r.t x2 from plotly import graph_objects as go trace = go.Scatter( - x=df['x2'].values, - y=df['y'].values, - mode='markers') + x=df["x2"].values, + y=df["y"].values, + mode="markers") data = [trace] fig = go.Figure(data) fig.show() @@ -167,7 +397,7 @@ def test_fit_model_via_design_matrix_with_weights(data_with_weights): def test_fit_model_via_design_matrix_stats_models(): - """Testing the model fits via statsmodels module""" + """Tests the model fits via statsmodels module""" df = generate_test_data_for_fitting( n=50, seed=41, @@ -230,7 +460,7 @@ def test_fit_model_via_design_matrix2(design_mat_info): def test_fit_model_via_design_matrix3(design_mat_info): """Tests fit_model_via_design_matrix with - elastic_net fit_algorithm and fit_algorithm_params""" + "lasso_lars" fit_algorithm and fit_algorithm_params""" x_train = design_mat_info["x_mat"] y_train = design_mat_info["y"] @@ -346,6 +576,11 @@ def test_fit_ml_model(): "max_admissible_value", "normalize_df_func", "regression_weight_col", + "drop_intercept_col", + "alpha", + "h_mat", + "p_effective", + "sigma_scaler", "fitted_df"] assert (trained_model["y"] == df["y"]).all() @@ -434,6 +669,100 @@ def test_fit_ml_model(): ["x1", " 0.5335", " 0.409", " 1.304", " 0.192", " -0.269", " 1.336"]) +def test_fit_ml_model_various_algo(time_series_data): + """Tests ``fit_ml_model`` with various algos. + This test is to insure that the implemented algorithms have the expected + behaviuor. To that we check for the performance of the algorithms in terms + of test error on simulated data. + """ + df_train = time_series_data["df_train"] + df_test = time_series_data["df_test"] + y_test = time_series_data["y_test"] + feature_cols = time_series_data["feature_cols"] + + # We consider two cases: + # (a) algorithms which are stable (handle large number of features) + # (b) algorithms which are unstable (do not handle large number of features) + # For (a) we test with large number of features and for (b) a small number of features. + # Temporarily removes `lars` and `lasso_lars` since they have unstable performance + # under linux and Mac. + fit_algorithms = [ + "rf", + "ridge", + "lasso", + # "lars", + "gradient_boosting", + # "lasso_lars", + "sgd", + "elastic_net"] + fit_algorithms_unstable = ["linear", "quantile_regression", "statsmodels_glm"] + + # In this case, we add some categorical variables with many levels + pred_cols = feature_cols + ["str_dow", "dom", "woy"] + model_formula_str = "y ~ " + "+".join(pred_cols) + # Small number of features for unstable algorithms + pred_cols_minimal = [ + "growth", + "sin1_toy", + "cos1_toy", + "str_dow"] + model_formula_minimal_str = "y ~ " + "+".join(pred_cols_minimal) + + # Expected error for each algo (in terms of R2) + expected_r2_dict = { + "rf": 0.97, + "ridge": 0.96, + "lasso": 0.98, + # "lars": 0.97, + "gradient_boosting": 0.98, + # "lasso_lars": 0.97, + "sgd": 0.97, + "elastic_net": 0.97, + "linear": 0.97, + "quantile_regression": 0.97, + "statsmodels_glm": 0.92} + + # Case (a) + for fit_algorithm in fit_algorithms: + trained_model = fit_ml_model( + df=df_train, + model_formula_str=model_formula_str, + fit_algorithm=fit_algorithm, + fit_algorithm_params=None, + y_col=None, + pred_cols=None) + + pred_res = predict_ml( + fut_df=df_test, + trained_model=trained_model) + + y_test_pred = pred_res["fut_df"]["y"] + + err = calc_pred_err(y_test, y_test_pred) + r2 = err[(EvaluationMetricEnum.Correlation.get_metric_name())] + assert r2 == pytest.approx(expected_r2_dict[fit_algorithm], rel=2e-2) + + # Case (b) + for fit_algorithm in fit_algorithms_unstable: + trained_model = fit_ml_model( + df=df_train, + model_formula_str=model_formula_minimal_str, + fit_algorithm=fit_algorithm, + fit_algorithm_params=None, + y_col=None, + pred_cols=None) + + pred_res = predict_ml( + fut_df=df_test, + trained_model=trained_model) + + y_test_pred = pred_res["fut_df"]["y"] + + err = calc_pred_err(y_test, y_test_pred) + r2 = err[(EvaluationMetricEnum.Correlation.get_metric_name())] + assert r2 == pytest.approx(expected_r2_dict[fit_algorithm], rel=2e-2) + + def test_fit_ml_model_normalization(): """Tests ``fit_ml_model`` with and without normalization""" @@ -625,7 +954,7 @@ def ci_width_and_coverage(conditional_cols, df, fut_df): ci_width_avg = ci_info["ci_width_avg"] assert round(ci_coverage, 1) == 94.7, ( "95 percent CI coverage is not as expected") - assert round(ci_width_avg, 1) == 22.4, ( + assert round(ci_width_avg, 1) == 22.6, ( "95 percent CI coverage average width is not as expected") # fitting heteroscedastic (with conditioning) uncertainty model @@ -637,9 +966,9 @@ def ci_width_and_coverage(conditional_cols, df, fut_df): ci_width_avg = ci_info["ci_width_avg"] # we observe better coverage is higher and ci width is narrower with # heteroscedastic model than before - assert round(ci_coverage, 1) == 96.3, ( + assert round(ci_coverage, 1) == 96.5, ( "95 percent CI coverage is not as expected") - assert round(ci_width_avg, 1) == 20.3, ( + assert round(ci_width_avg, 1) == 20.5, ( "95 percent CI coverage average width is not as expected") @@ -972,7 +1301,7 @@ def test_fit_ml_model_with_evaluation_sgd(): df=df, model_formula_str=model_formula_str, fit_algorithm="sgd", - fit_algorithm_params={"penalty": "none"}) + fit_algorithm_params={"penalty": None}) pred_res = predict_ml( fut_df=df_test, @@ -1012,18 +1341,227 @@ def test_fit_ml_model_with_evaluation_sgd(): assert err[enum.get_metric_name()] > 0.5 +def test_fit_ml_model_with_h_mat(): + """Tests the output of `fit_ml_model` and function `get_h_mat` for different scenarios.""" + def helper_test_h_mat(const_val, remove_intercept, fit_algorithm, normalize_method): + # Does not require a seed since tests only check the correctness / consistency. + n_total = 150 + X = np.random.rand(n_total, 3) + X = np.concatenate([const_val * np.ones((n_total, 1)), X], axis=1) + beta = np.ones((X.shape[1], 1)) + y = X @ beta + 1 + np.random.normal(0, 1, n_total).reshape(-1, 1) + X_train = X[:100, :] + y_train = y[:100, :] + + df = pd.DataFrame(np.concatenate([y_train, X_train], axis=1), columns=["y", "const", "a", "b", "c"]) + model_formula_str = "y~const+a+b+c" + + uncertainty_dict = { + "uncertainty_method": "simple_conditional_residuals", + "params": { + "conditional_cols": [], + "quantiles": [0.025, 0.975], + "quantile_estimation_method": "normal_fit", + "sample_size_thresh": 5, + "small_sample_size_method": "std_quantiles", + "small_sample_size_quantile": 0.98}} + + remove_intercept = remove_intercept + model = fit_ml_model( + df=df, + model_formula_str=model_formula_str, + fit_algorithm=fit_algorithm, + fit_algorithm_params=None, + uncertainty_dict=uncertainty_dict, + normalize_method=normalize_method, + regression_weight_col=None, + remove_intercept=remove_intercept) + + ml_model = model["ml_model"] + # Prepares different versions of X matrix. + X_mat = np.array(model["x_mat"]) + X_centered = X_mat - X_mat.mean(axis=0) + y_centered = y_train - y.mean() + ci_model = model["uncertainty_model"] + L = ci_model["lu_d_sqrt"] + p_effective = model["p_effective"] + alpha = ml_model.alpha_ if fit_algorithm == "ridge" else 0 + # Calls `get_h_mat` with different X to compute H matrix. + H = get_h_mat(X_centered, alpha) if fit_algorithm == "ridge" else get_h_mat(X_mat, alpha) + + if fit_algorithm == "linear": + # Tests beta_hat. + expected_beta_hat = np.array(ml_model.params).reshape(-1, 1) # No additional intercept. + beta_hat = H @ y_train + assert_equal(X_mat @ expected_beta_hat, X_mat @ beta_hat) + + # Tests the decomposition of the H matrix. + assert np.linalg.norm(H @ H.T - L @ L.T) < 1e-8 + + # Tests `p_effective`. + assert_equal(p_effective, np.linalg.matrix_rank(X_mat)) + + # Tests the values in `ci_model`. + assert ci_model["n_train"] is not None + assert ci_model["x_train_mean"] is None + assert (ci_model["pi_se_scaler"] >= 1).all() + + if fit_algorithm == "ridge": + # Tests beta hat. + # Using `y_centered` or `y_train` should give the same result, + # because `H @ (y_train - y_centered) = 0`. + assert_equal(ml_model.coef_, (H @ y_centered).reshape(-1)) + assert_equal(ml_model.coef_, (H @ y_train).reshape(-1)) + + # Tests intercept. + beta_hat = H @ y_centered + assert_equal(ml_model.intercept_, (y_train - X_mat @ beta_hat).mean()) + + # Tests the decomposition of the H matrix. + assert np.linalg.norm(H @ H.T - L @ L.T) < 1e-8 + + # Tests `p_effective`. + assert_equal(p_effective, np.trace(H @ X_centered) + 1) + + # Tests the values in `ci_model`. + assert ci_model["n_train"] is not None + assert ci_model["x_train_mean"] is not None + assert (ci_model["pi_se_scaler"] >= 1).all() + + for const_val in [0, 1, 2]: + for remove_intercept in [True, False]: + for fit_algorithm in ["linear", "ridge"]: + for normalize_method in ["zero_to_one", + "statistical", + "minus_half_to_half", + "zero_at_origin"]: + helper_test_h_mat(const_val, remove_intercept, fit_algorithm, normalize_method) + + +def test_p_effective(): + """Tests the computation of `p_effective` in `fit_ml_model` for "linear" and "ridge" models.""" + np.random.seed(123) + X = np.random.rand(5, 3) + n, p = X.shape + beta = np.array([1] * p).reshape((p, 1)) + y = X @ beta + + # Fits a linear regression. + model = fit_model_via_design_matrix( + x_train=X, + y_train=y, + fit_algorithm="linear", + fit_algorithm_params=None, + sample_weight=None) + alpha = 0 + XTX_alpha = X.T @ X + np.diag([alpha] * p) + p_effective = round(np.trace(scipy.linalg.pinvh(XTX_alpha) @ X.T @ X), 3) + + assert p == 3 + # The `df_model` attribute from `statsmodels` is inconsistent. + # The value is supposed to be rank minus 1, i.e. 2. + # We add this check so that we're aware of the inconsistency until it is fixed. + assert model.df_model == 3 + assert np.linalg.matrix_rank(X) == 3 + assert p_effective == 3 + + # Duplicates the columns in `X`, result should not change. + X = np.concatenate([X, X], axis=1) + n, p = X.shape + beta = np.array([1] * p).reshape((p, 1)) + y = X @ beta + + model = fit_model_via_design_matrix( + x_train=X, + y_train=y, + fit_algorithm="linear", + fit_algorithm_params=None, + sample_weight=None) + alpha = 0 + XTX_alpha = X.T @ X + np.diag([alpha] * p) + p_effective = round(np.trace(scipy.linalg.pinvh(XTX_alpha) @ X.T @ X), 3) + + assert p == 6 + # The `df_model` attribute from `statsmodels` is inconsistent. + # The value is supposed to be rank minus 1, i.e. 2. + # We add this check so that we're aware of the inconsistency until it is fixed. + assert model.df_model == 3 + assert np.linalg.matrix_rank(X) == 3 + assert p_effective == 3 + + # Adds an intercept column to `X`, `p_effective` and rank should increase by 1. + X = np.concatenate([X, np.ones((5, 1))], axis=1) + n, p = X.shape + beta = np.array([1] * p).reshape((p, 1)) + y = X @ beta + + model = fit_model_via_design_matrix( + x_train=X, + y_train=y, + fit_algorithm="linear", + fit_algorithm_params=None, + sample_weight=None) + alpha = 0 + XTX_alpha = X.T @ X + np.diag([alpha] * p) + p_effective = round(np.trace(scipy.linalg.pinvh(XTX_alpha) @ X.T @ X), 3) + + assert p == 7 + assert model.df_model == 3 # This is the expected behavior. + assert np.linalg.matrix_rank(X) == 4 + assert p_effective == 4 + + # Fits a ridge regression. + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + model = fit_model_via_design_matrix( + x_train=X, + y_train=y, + fit_algorithm="ridge", + fit_algorithm_params=None, + sample_weight=None) + alpha = model.alpha_ + XTX_alpha = X.T @ X + np.diag([alpha] * p) + log_cond = np.log10(np.linalg.cond(XTX_alpha)) + digits_to_lose = 8 + if log_cond < digits_to_lose: + h_mat = scipy.linalg.solve(XTX_alpha, X.T, assume_a="pos") + else: + h_mat = scipy.linalg.pinvh(XTX_alpha) @ X.T + p_effective = round(np.trace(h_mat @ X), 3) + + assert round(log_cond, 1) == 6.1 + assert p_effective == 4.0 + + def test_dummy(): + """Tests a dummy dataset where the design matrix has perfectly correlated columns.""" df = pd.DataFrame({"a": [1, 2, 1], "b": [1, 3, 1], "c": ["a", "b", "a"]}) df = pd.get_dummies(df) - df["y"] = [1, 5, 4] + df["y"] = [1, 6, 1] model_formula_str = "y~a+b+c_a+c_b" trained_model = fit_ml_model_with_evaluation( df=df, model_formula_str=model_formula_str, - training_fraction=1.0) - expected_coefs = np.array([0., 1., 1., -1., 1.]) - obtained_coefs = np.array(trained_model["ml_model"].coef_).round() - np.array_equal(expected_coefs, obtained_coefs) + training_fraction=1.0, + remove_intercept=False, + normalize_method=None + ) + expected_coefs = np.array([0, 1., 1., -1., 1.]) + obtained_coefs = np.array(trained_model["ml_model"].coef_).round(2) + # The fitted coefficients are not the same as expected, + # but the fitted values are equal to the actual y since design matrix is singular. + assert not np.array_equal(expected_coefs, obtained_coefs) + X = np.array(pd.concat([ + pd.DataFrame({"intercept": [1, 1, 1]}), + df[["a", "b", "c_a", "c_b"]] + ], axis=1)) + y_fitted = X @ expected_coefs + assert np.array_equal(np.array(df["y"]), y_fitted.round(8)) + + n = df.shape[0] + p_effective = trained_model["p_effective"] + assert round(p_effective, 2) == 2 + assert trained_model["sigma_scaler"] == np.sqrt((n - 1) / (n - p_effective)) def test_fit_ml_model_with_evaluation_nan(): @@ -1045,6 +1583,9 @@ def test_fit_ml_model_with_evaluation_nan(): training_fraction=1.0) assert "The data frame included 1 row(s) with NAs which were removed for model fitting."\ in record[0].message.args[0] + # Since the design matrix is singular, variance scaling is skipped. + assert "Zero degrees of freedom" in record[1].message.args[0] + assert trained_model["sigma_scaler"] is None assert_equal(trained_model["y"], df["y"].loc[(0, 1, 3), ]) @@ -1132,7 +1673,7 @@ def test_fit_ml_model_with_evaluation_constant_column_sgd(): df=df, model_formula_str=model_formula_str, fit_algorithm=fit_algorithm, - fit_algorithm_params={"tol": 1e-5, "penalty": "none"}) + fit_algorithm_params={"tol": 1e-5, "penalty": None}) pred_res = predict_ml( fut_df=df_test, diff --git a/greykite/tests/algo/common/test_seasonality_inferrer.py b/greykite/tests/algo/common/test_seasonality_inferrer.py index 6542fb1..cb302e3 100644 --- a/greykite/tests/algo/common/test_seasonality_inferrer.py +++ b/greykite/tests/algo/common/test_seasonality_inferrer.py @@ -9,6 +9,7 @@ from greykite.common import constants as cst from greykite.common.testing_utils import assert_equal from greykite.common.testing_utils import generate_df_for_tests +from greykite.common.time_properties import infer_freq @pytest.fixture @@ -214,12 +215,12 @@ def test_adjust_trend(df): ) assert_equal( df_adj.loc[:4, "y"], - pd.Series([-3.6315, 2.8088, 3.6551, 0.5856, 0.7478], name="y"), + pd.Series([-3.4947, 2.9445, 3.7897, 0.7191, 0.8803], name="y"), rel=1e-3 ) assert_equal( df_adj.loc[:4, model.FITTED_TREND_COL], - pd.Series([-0.7820, -0.7752, -0.7683, -0.7614, -0.7545], name=model.FITTED_TREND_COL), + pd.Series([-0.9188, -0.9108, -0.9029, -0.8950, -0.8870], name=model.FITTED_TREND_COL), rel=1e-3 ) @@ -270,7 +271,7 @@ def test_process_df(df): adjust_trend_params=None, aggregation_period="W-SUN" ) - assert pd.infer_freq(df_adj[cst.TIME_COL]) == "W-SUN" + assert infer_freq(df_adj, cst.TIME_COL) == "W-SUN" def test_tolerance(df): diff --git a/greykite/tests/algo/forecast/silverkite/test_auto_config.py b/greykite/tests/algo/forecast/silverkite/test_auto_config.py index a63c239..8a5c3c8 100644 --- a/greykite/tests/algo/forecast/silverkite/test_auto_config.py +++ b/greykite/tests/algo/forecast/silverkite/test_auto_config.py @@ -85,7 +85,7 @@ def test_get_auto_holiday(df_daily): custom_event=custom_event ) ) - assert len(holidays) == 31 # Only United States is used. + assert len(holidays) == 34 # Only United States is used. assert holidays["custom_event"].equals(custom_event) assert "Holiday_positive_group" in holidays assert "Holiday_negative_group" in holidays diff --git a/greykite/tests/algo/forecast/silverkite/test_forecast_silverkite.py b/greykite/tests/algo/forecast/silverkite/test_forecast_silverkite.py index 2da8448..381bdfa 100644 --- a/greykite/tests/algo/forecast/silverkite/test_forecast_silverkite.py +++ b/greykite/tests/algo/forecast/silverkite/test_forecast_silverkite.py @@ -1,12 +1,14 @@ import datetime +import warnings from datetime import timedelta import matplotlib import numpy as np import pandas as pd import pytest +from pandas.testing import assert_frame_equal from pandas.tseries.frequencies import to_offset -from pandas.util.testing import assert_frame_equal +from sklearn.exceptions import ConvergenceWarning from testfixtures import LogCapture from greykite.algo.changepoint.adalasso.changepoint_detector import ChangepointDetector @@ -18,6 +20,9 @@ from greykite.common.constants import END_TIME_COL from greykite.common.constants import ERR_STD_COL from greykite.common.constants import EVENT_DF_LABEL_COL +from greykite.common.constants import IS_EVENT_ADJACENT_COL +from greykite.common.constants import IS_EVENT_COL +from greykite.common.constants import IS_EVENT_EXACT_COL from greykite.common.constants import LOGGER_NAME from greykite.common.constants import QUANTILE_SUMMARY_COL from greykite.common.constants import START_TIME_COL @@ -37,6 +42,7 @@ from greykite.common.features.timeseries_impute import impute_with_lags from greykite.common.features.timeseries_lags import build_autoreg_df from greykite.common.features.timeseries_lags import build_autoreg_df_multi +from greykite.common.gen_moving_timeseries_forecast import gen_moving_timeseries_forecast from greykite.common.python_utils import assert_equal from greykite.common.python_utils import get_pattern_cols from greykite.common.testing_utils import generate_anomalous_data @@ -48,7 +54,7 @@ matplotlib.use("agg") # noqa: E402 -import matplotlib.pyplot as plt # isort:skip +import matplotlib.pyplot as plt # isort:skip # noqa: E402 @pytest.fixture @@ -61,6 +67,30 @@ def hourly_data(): conti_year_origin=2018) +@pytest.fixture +def real_data(): + """Loads and prepares some real data sets for performance testing.""" + dl = DataLoader() + df_pt = dl.load_peyton_manning() + df_hourly_bk = dl.load_bikesharing() + # This adds a small number to avoid zeros in MAPE calculation + df_hourly_bk["count"] += 1 + + agg_func = {"count": "sum", "tmin": "mean", "tmax": "mean", "pn": "mean"} + df_bk = dl.load_bikesharing(agg_freq="daily", agg_func=agg_func) + + # This adds a small number to avoid zeros in MAPE calculation + df_bk["count"] += 10 + # Drops last value as data might be incorrect since the original data is hourly + df_bk.drop(df_bk.tail(1).index, inplace=True) + df_bk.reset_index(drop=True, inplace=True) + + return { + "daily_pt": df_pt, + "hourly_bk": df_hourly_bk, + "daily_bk": df_bk} + + @pytest.fixture def lagged_regressor_dict(): """Generate a dictionary of 3 lagged regressors with different dtypes""" @@ -231,6 +261,48 @@ def test_forecast_silverkite_hourly(hourly_data): """ +def test_forecast_silverkite_hourly_with_dst(hourly_data): + """Tests silverkite on hourly data with daylight saving variables""" + train_df = hourly_data["train_df"] + test_df = hourly_data["test_df"] + fut_time_num = hourly_data["fut_time_num"] + + silverkite = SilverkiteForecast() + trained_model = silverkite.forecast( + df=train_df, + time_col=TIME_COL, + value_col=VALUE_COL, + train_test_thresh=datetime.datetime(2019, 6, 1), + origin_for_time_vars=None, + fs_components_df=pd.DataFrame({ + "name": ["tod", "tow", "conti_year"], + "period": [24.0, 7.0, 1.0], + "order": [3, 0, 5]}), + extra_pred_cols=["ct_sqrt", "dow_hr", "ct1", "us_dst*dow_hr"], + normalize_method="zero_to_one") + + assert "us_dst*dow_hr" in trained_model["pred_cols"] + + fut_df = silverkite.predict_n_no_sim( + fut_time_num=fut_time_num, + trained_model=trained_model, + freq="H", + new_external_regressor_df=None)["fut_df"] + + err = calc_pred_err(test_df[VALUE_COL], fut_df[VALUE_COL]) + enum = EvaluationMetricEnum.Correlation + assert err[enum.get_metric_name()] > 0.3 + enum = EvaluationMetricEnum.RootMeanSquaredError + assert err[enum.get_metric_name()] < 6.0 + assert trained_model["x_mat"]["ct1"][0] == 0 # this should be true when origin_for_time_vars=None + """ + plt_comparison_forecast_vs_observed( + fut_df=fut_df, + test_df=test_df, + file_name=None) + """ + + def test_forecast_silverkite_pred_cols(hourly_data): """Tests silverkite on hourly data with varying predictor set ups. In particular we test ``drop_pred_cols``, ``admitted_pred_cols``""" @@ -741,12 +813,11 @@ def test_forecast_silverkite_freq(): check_like=True) -def test_forecast_silverkite_changepoints(): +def test_forecast_silverkite_changepoints(real_data): """Tests forecast_silverkite on peyton manning data (with changepoints and missing values) """ - dl = DataLoader() - df_pt = dl.load_peyton_manning() + df_pt = real_data["daily_pt"] silverkite = SilverkiteForecast() trained_model = silverkite.forecast( @@ -793,10 +864,9 @@ def test_forecast_silverkite_changepoints(): assert len(changepoint_values) == len(changepoint_dates) -def test_forecast_silverkite_seasonality_changepoints(): +def test_forecast_silverkite_seasonality_changepoints(real_data): # test forecast_silverkite on peyton manning data - dl = DataLoader() - df_pt = dl.load_peyton_manning() + df_pt = real_data["daily_pt"] silverkite = SilverkiteForecast() # seasonality changepoints is None if dictionary is not provided trained_model = silverkite.forecast( @@ -1606,8 +1676,8 @@ def test_forecast_silverkite_2min_with_uncertainty(): axis=1) ci_coverage = 100.0 * fut_df["inside_95_ci"].mean() - assert round(ci_coverage) == 91, ( - "95 percent CI coverage is not as expected (91%)") + assert round(ci_coverage) == 93, ( + "95 percent CI coverage is not as expected (93%)") err = calc_pred_err(test_df[VALUE_COL], fut_df[VALUE_COL]) enum = EvaluationMetricEnum.Correlation @@ -1654,6 +1724,7 @@ def test_forecast_silverkite_simulator(): past_df = train_df[[TIME_COL, VALUE_COL]].copy() # simulations with error + np.random.seed(123) sim_df = silverkite.simulate( fut_df=fut_df, trained_model=trained_model, @@ -1661,13 +1732,12 @@ def test_forecast_silverkite_simulator(): new_external_regressor_df=None, include_err=True)["sim_df"] - np.random.seed(123) assert sim_df[VALUE_COL].dtype == "float64" err = calc_pred_err(test_df[VALUE_COL], sim_df[VALUE_COL]) enum = EvaluationMetricEnum.Correlation - assert round(err[enum.get_metric_name()], 2) == 0.97 + assert round(err[enum.get_metric_name()], 2) == 0.90 enum = EvaluationMetricEnum.RootMeanSquaredError - assert round(err[enum.get_metric_name()], 2) == 0.55 + assert round(err[enum.get_metric_name()], 2) == 1.05 # simulations without errors sim_df = silverkite.simulate( @@ -1677,7 +1747,6 @@ def test_forecast_silverkite_simulator(): new_external_regressor_df=None, include_err=False)["sim_df"] - np.random.seed(123) assert sim_df[VALUE_COL].dtype == "float64" err = calc_pred_err(test_df[VALUE_COL], sim_df[VALUE_COL]) enum = EvaluationMetricEnum.Correlation @@ -1834,6 +1903,7 @@ def test_forecast_silverkite_predict_via_sim(): # Predicts with the original sim # import time # t0 = time.time() + np.random.seed(1) # Result is highly sensitive to seed due to propagating errors in AR terms. pred_df = silverkite.predict_via_sim( fut_df=fut_df, trained_model=trained_model, @@ -1850,9 +1920,9 @@ def test_forecast_silverkite_predict_via_sim(): ERR_STD_COL] err = calc_pred_err(test_df[VALUE_COL], pred_df[VALUE_COL]) enum = EvaluationMetricEnum.Correlation - assert round(err[enum.get_metric_name()], 2) == 0.80 + assert round(err[enum.get_metric_name()], 2) == 0.76 enum = EvaluationMetricEnum.RootMeanSquaredError - assert round(err[enum.get_metric_name()], 2) == 0.14 + assert round(err[enum.get_metric_name()], 2) == 0.16 """ import os import plotly @@ -1863,8 +1933,7 @@ def test_forecast_silverkite_predict_via_sim(): plotly.offline.plot(fig, filename=html_file_name) """ - # Predicts via fast simulation method - np.random.seed(123) + # Predicts via fast simulation method (no seed is needed) # t0 = time.time() pred_df = silverkite.predict_via_sim_fast( fut_df=fut_df, @@ -2003,7 +2072,7 @@ def test_forecast_silverkite_predict_via_sim2(): enum = EvaluationMetricEnum.RootMeanSquaredError rmse = err[enum.get_metric_name()] err = 100 * (rmse / abs_y_mean) - assert round(err) == 15 + assert round(err) == 13 # Checks if simulation is done correctly # It manually calculates the predictions using lags and @@ -2193,7 +2262,7 @@ def test_forecast_silverkite_predict_via_sim3(): enum = EvaluationMetricEnum.RootMeanSquaredError rmse = err[enum.get_metric_name()] err = 100 * (rmse / abs_y_mean) - assert round(err, 2) == 0.58 + assert round(err, 2) == 0.57 """ import os import plotly @@ -2359,6 +2428,214 @@ def test_silverkite_predict(): assert list(predict_info["fut_df"].columns) == expected_fut_df_cols +def test_silverkite_various_algos(real_data): + """A test function to benchmark the performance of various algorithms in core silverkite. + This test is also useful for adding new functionality and algorithms and + includes a low level bench-marking method. + It does not require more advanced machinery available at other layers e.g. pipeline. + This test is not intended to replace comprehensive bench-marking but + it is included to do a quick performance test on the core level.""" + autoreg_coefs = [0.5] * 15 + periods = 600 + df_pt = real_data["daily_pt"] + df_bk = real_data["daily_bk"] + df_bk = df_bk[["ts", "count"]] + df_bk.columns = ["ts", "y"] + np.random.seed(1317) + data = generate_df_for_tests( + freq="D", + periods=periods + len(autoreg_coefs), # Generates more data to avoid missing at the end + train_frac=0.8, + train_end_date=None, + noise_std=0.5, + remove_extra_cols=True, + autoreg_coefs=autoreg_coefs, + fs_coefs=[0.1, 1, 0.1], + growth_coef=2.0, + intercept=10.0) + + df = data["df"][:periods] # Removes last few values which are Nan due to using of lags in the data generation + df_dict = {} + df_dict["daily_pt"] = df_pt[:periods] + df_dict["daily_bk"] = df_bk[:periods] + df_dict["sim"] = df + """ + # Commented out code to inspect the raw series. + import os + home = os.path.expanduser("~") + from greykite.common.viz.timeseries_annotate import plot_lines_markers + + for label, df in df_dict.items(): + fig = plot_lines_markers( + df=df, + x_col="ts", + line_cols=["y"], + marker_cols=None, + line_colors=None, + marker_colors=None) + + fig.write_html(f"{home}/raw_series_{label}.html", auto_open=True) + """ + + def train_forecast_fcn( + fit_algorithm, + changepoints_dict): + """It constructs a function which fits data and then forecasts for the + given parameters. The constructed function will be an input to + ``~greykite.common.gen_moving_timeseries_forecast.gen_moving_timeseries_forecast`` + For details abou the expected inputs and outputs of the generated function see + the description for the input argument: ``train_forecast_func`` of ``gen_moving_timeseries_forecast``""" + + def train_forecast_func( + df, + time_col, + value_col, + forecast_horizon, + new_external_regressor_df=None): + """A function which fits a forecast model and then predicts for given + forecast horizon. This function will be a proper input to + ``~greykite.common.gen_moving_timeseries_forecast.gen_moving_timeseries_forecast`` + For description of the inputs and output of this function refer to ``gen_moving_timeseries_forecast``""" + autoreg_dict = { + "lag_dict": {"orders": list(range(7, 14))}, + "agg_lag_dict": None, + "series_na_fill_func": lambda s: s.bfill().ffill()} + + countries = ["US", "India"] + event_df_dict = get_holidays( + countries, + year_start=2000, + year_end=2025) + + silverkite = SilverkiteForecast() + trained_model = silverkite.forecast( + df=df, + time_col=TIME_COL, + value_col=VALUE_COL, + train_test_thresh=None, + origin_for_time_vars=None, + fit_algorithm=fit_algorithm, + fs_components_df=pd.DataFrame({ + "name": ["tow", "conti_year"], + "period": [7.0, 1.0], + "order": [7, 15]}), + extra_pred_cols=["ct1"], + daily_event_df_dict=event_df_dict, + changepoints_dict=changepoints_dict, + seasonality_changepoints_dict=None, + uncertainty_dict=None, + autoreg_dict=autoreg_dict, + fast_simulation=True) + + # Prediction phase + np.random.seed(123) + predict_info = silverkite.predict_n( + fut_time_num=forecast_horizon, + trained_model=trained_model, + past_df=None, + new_external_regressor_df=None, + include_err=None, + force_no_sim=False) + + fut_df = predict_info["fut_df"] + + return { + "fut_df": fut_df + } + + return train_forecast_func + + def cross_validate_all_models(changepoints_dict): + """Runs cross validation for all algo, data pairs. It returns a dictionary + which contains the error for each model / data pair. + + The input is ``changepoints_dict`` which is passed to ``SilverkiteForecast`` + class's ``forecast`` function. See + ~greykite.algo.forecast.silverkite.forecast_silverkite.SilverkiteForecast + for more details. + + The output is a dictionary with: + - keys being f"{data_label}_{fit_algorithm}" + - values being tuples (R2, MAPE). + + """ + + def cross_validate_one_model(data_label, fit_algorithm, changepoints_dict): + """Runs cross validation for one algo, data pair.""" + train_forecast_func = train_forecast_fcn( + fit_algorithm=fit_algorithm, + changepoints_dict=changepoints_dict) + + with warnings.catch_warnings(): + warnings.filterwarnings("ignore", category=ConvergenceWarning) + num_tests = 8 + train_move_ahead = 13 + min_training_end_point = len(df) - (num_tests * train_move_ahead) + test_res = gen_moving_timeseries_forecast( + df=df_dict[data_label].copy(), + time_col="ts", + value_col="y", + train_forecast_func=train_forecast_func, + train_move_ahead=train_move_ahead, + forecast_horizon=3, + min_training_end_point=min_training_end_point) + + compare_df = test_res["compare_df"] + validation_num = test_res["validation_num"] + assert round(validation_num) == num_tests + + err = calc_pred_err(compare_df["y_true"], compare_df["y_hat"]) + enum = EvaluationMetricEnum.Correlation + r2 = err[enum.get_metric_name()] + enum = EvaluationMetricEnum.MeanAbsolutePercentError + mape = err[enum.get_metric_name()] + err_dict[f"{data_label}_{fit_algorithm}"] = (round(r2, 2), round(mape, 2)) + return None + + err_dict = {} + algos = ["ridge", "elastic_net"] + + for data_label in df_dict.keys(): + for fit_algorithm in algos: + cross_validate_one_model( + data_label, + fit_algorithm, + changepoints_dict) + + return err_dict + + # The case without changepoints + err_dict = cross_validate_all_models( + changepoints_dict=None) + + assert err_dict == { + "daily_pt_ridge": (0.86, 2.97), + "daily_pt_elastic_net": (0.89, 2.8), + "daily_bk_ridge": (0.74, 25.68), + "daily_bk_elastic_net": (0.75, 25.94), + "sim_ridge": (0.93, 0.8), + "sim_elastic_net": (0.94, 0.85)} + + # The case with change points + changepoints_dict0 = { + "method": "auto", + "yearly_seasonality_order": 15, + "resample_freq": "2D", + "actual_changepoint_min_distance": "50D", + "potential_changepoint_distance": "10D", + "no_changepoint_proportion_from_end": 0.1} + err_dict_cp = cross_validate_all_models( + changepoints_dict=changepoints_dict0) + + assert err_dict_cp == { + "daily_pt_ridge": (0.82, 3.36), + "daily_pt_elastic_net": (0.89, 2.8), + "daily_bk_ridge": (0.74, 25.7), + "daily_bk_elastic_net": (0.75, 25.94), + "sim_ridge": (0.93, 0.8), + "sim_elastic_net": (0.94, 0.85)} + + def test_predict_silverkite_with_regressors(): """Testing ``predict_silverkite`` in presence of regressors""" data = generate_df_with_reg_for_tests( @@ -3093,7 +3370,7 @@ def fit_silverkite(autoreg_dict): err = calc_pred_err(test_df[VALUE_COL], fut_df_with_ar[VALUE_COL]) enum = EvaluationMetricEnum.RootMeanSquaredError - assert err[enum.get_metric_name()] == pytest.approx(85.896, rel=1e-2) + assert err[enum.get_metric_name()] == pytest.approx(113.482, rel=1e-2) err = calc_pred_err(test_df[VALUE_COL], fut_df_no_ar[VALUE_COL]) enum = EvaluationMetricEnum.RootMeanSquaredError @@ -3234,9 +3511,9 @@ def test_forecast_silverkite_simulator_regressor(): assert sim_df[VALUE_COL].dtype == "float64" err = calc_pred_err(test_df[VALUE_COL], sim_df[VALUE_COL]) enum = EvaluationMetricEnum.Correlation - assert round(err[enum.get_metric_name()], 2) == 0.56 + assert round(err[enum.get_metric_name()], 2) == 0.45 enum = EvaluationMetricEnum.RootMeanSquaredError - assert round(err[enum.get_metric_name()], 2) == 2.83 + assert round(err[enum.get_metric_name()], 2) == 3.65 # predict via sim np.random.seed(123) @@ -3257,9 +3534,9 @@ def test_forecast_silverkite_simulator_regressor(): assert sim_df[VALUE_COL].dtype == "float64" err = calc_pred_err(test_df[VALUE_COL], fut_df[VALUE_COL]) enum = EvaluationMetricEnum.Correlation - assert round(err[enum.get_metric_name()], 2) == 0.65 + assert round(err[enum.get_metric_name()], 2) == 0.60 enum = EvaluationMetricEnum.RootMeanSquaredError - assert round(err[enum.get_metric_name()], 2) == 2.35 + assert round(err[enum.get_metric_name()], 2) == 2.51 """ plt_comparison_forecast_vs_observed( @@ -3310,7 +3587,8 @@ def test_forecast_silverkite_with_holidays_hourly(): "seas_names": ["daily", "weekly", None]}), extra_pred_cols=["ct_sqrt", "dow_hr", f"events_US*{fourier_col1}", f"events_US*{fourier_col2}", - f"events_US*{fourier_col3}"], + f"events_US*{fourier_col3}", + IS_EVENT_COL, IS_EVENT_EXACT_COL, IS_EVENT_ADJACENT_COL], daily_event_df_dict=event_df_dict) fut_df = silverkite.predict_n_no_sim( @@ -3330,6 +3608,19 @@ def test_forecast_silverkite_with_holidays_hourly(): file_name=None) """ + # Checks that event indicators are in the feature matrix + assert set( + [IS_EVENT_COL, IS_EVENT_EXACT_COL, IS_EVENT_ADJACENT_COL] + ).issubset(set(trained_model["x_mat"].columns)) + # Checks that event indicators have correct values + num_is_event = trained_model["x_mat"][IS_EVENT_COL].sum() + num_is_event_exact = trained_model["x_mat"][IS_EVENT_EXACT_COL].sum() + num_is_event_adjacent = trained_model["x_mat"][IS_EVENT_ADJACENT_COL].sum() + assert num_is_event == num_is_event_exact + assert num_is_event_exact > 0 + assert num_is_event_adjacent == 0 + assert num_is_event == num_is_event_exact + num_is_event_adjacent + def test_forecast_silverkite_with_holidays_effect(): """Tests silverkite, modeling a separate effect per holiday @@ -3357,8 +3648,8 @@ def test_forecast_silverkite_with_holidays_effect(): holidays_to_model_separately=holidays_to_model_separately, year_start=2015, year_end=2025, - pre_num=0, - post_num=0) + pre_num=1, + post_num=1) # constant event effect at daily level event_cols = [f"Q('events_{key}')" for key in event_df_dict.keys()] @@ -3369,7 +3660,8 @@ def test_forecast_silverkite_with_holidays_effect(): fs_name="tod", fs_order=3, fs_seas_name="daily") - extra_pred_cols = ["ct_sqrt", "dow_hr"] + event_cols + interaction_cols + extra_pred_cols = ["ct_sqrt", "dow_hr"] + event_cols + interaction_cols + \ + [IS_EVENT_COL, IS_EVENT_EXACT_COL, IS_EVENT_ADJACENT_COL, f"{IS_EVENT_COL}:is_weekend"] silverkite = SilverkiteForecast() trained_model = silverkite.forecast( df=train_df, @@ -3402,6 +3694,39 @@ def test_forecast_silverkite_with_holidays_effect(): file_name=None) """ + # Checks that event indicators are in the feature matrix + assert set( + [IS_EVENT_COL, IS_EVENT_EXACT_COL, IS_EVENT_ADJACENT_COL, f"{IS_EVENT_COL}:is_weekend[T.True]"] + ).issubset(set(trained_model["x_mat"].columns)) + # Checks that event indicators have correct values + num_is_event = trained_model["x_mat"][IS_EVENT_COL].sum() + num_is_event_exact = trained_model["x_mat"][IS_EVENT_EXACT_COL].sum() + num_is_event_adjacent = trained_model["x_mat"][IS_EVENT_ADJACENT_COL].sum() + assert num_is_event > num_is_event_exact # is_event is a union of is_event_exact and is_event_adjacent + assert num_is_event_exact > 0 + assert num_is_event_adjacent > 0 + assert num_is_event <= num_is_event_exact + num_is_event_adjacent # event and its neighboring days may overlap + + # Event indicators should not be included if not specified + extra_pred_cols = ["ct_sqrt", "dow_hr"] + event_cols + interaction_cols + silverkite = SilverkiteForecast() + trained_model = silverkite.forecast( + df=train_df, + time_col="ts", + value_col=VALUE_COL, + train_test_thresh=datetime.datetime(2019, 6, 1), + origin_for_time_vars=2018, + fs_components_df=pd.DataFrame({ + "name": ["tod", "tow", "conti_year"], + "period": [24.0, 7.0, 1.0], + "order": [3, 0, 5], + "seas_names": ["daily", "weekly", None]}), + extra_pred_cols=extra_pred_cols, + daily_event_df_dict=event_df_dict) + assert set( + [IS_EVENT_COL, IS_EVENT_EXACT_COL, IS_EVENT_ADJACENT_COL] + ).intersection(set(trained_model["x_mat"].columns)) == set() + def test_forecast_silverkite_train_test_thresh_error(hourly_data): df = hourly_data["df"] @@ -3477,7 +3802,7 @@ def test_forecast_silverkite_with_adjust_anomalous(): "start_time_col": START_TIME_COL, "end_time_col": END_TIME_COL, "adjustment_delta_col": ADJUSTMENT_DELTA_COL, - "filter_by_dict": {"platform": "MOBILE"}}}) + "filter_by_dict": {"dimension1": "level_1"}}}) adj_df_info = trained_model["adjust_anomalous_info"] @@ -5016,3 +5341,168 @@ def test_past_df_sufficient_warning_for_monthly_data(hourly_data): "DEBUG", "``past_df`` is not sufficient, imputation is performed when creating autoregression terms." ) not in log_capture.actual() + + +def test_min_admissible_value_with_simulations(): + """Tests ``min_admissible_value`` and ``max_admissible_value`` in `simulate` to check its basic functionality; + Then tests in `predict_via_sim` to check if the final results are all bounded. + """ + + # Generates data and model + train_len = 500 + test_len = 10 + data_len = train_len + test_len + np.random.seed(179) + ts = pd.date_range(start="1/1/2018", periods=data_len, freq="D") + z = np.random.randint(low=-50, high=50, size=data_len) + y = [0]*data_len + y[0] = 600 + y[1] = 600 + y[2] = 600 + y[3] = 600 + y[4] = 600 + + # Explicitly defines auto-regressive structure + for i in range(4, data_len): + y[i] = round(0.5*y[i-1] + 0.5*y[i-2] + z[i]) + + df = pd.DataFrame({ + "ts": ts, + "y": y}) + df["y"] = df["y"].map(float) + df["ts"] = pd.to_datetime(df["ts"]) + + train_df = df[:(train_len)].reset_index(drop=True) + test_df = df[(train_len):].reset_index(drop=True) + fut_df = test_df.copy() + fut_df[VALUE_COL] = None + + silverkite = SilverkiteForecast() + autoreg_dict = { + "lag_dict": {"orders": list(range(1, 3))}, + "agg_lag_dict": None, + "series_na_fill_func": lambda s: s.bfill().ffill()} + + trained_model = silverkite.forecast( + df=train_df, + time_col=TIME_COL, + value_col=VALUE_COL, + train_test_thresh=None, + origin_for_time_vars=None, + fs_components_df=None, + extra_pred_cols=None, + drop_pred_cols=["ct1"], + autoreg_dict=autoreg_dict, + uncertainty_dict={ + "uncertainty_method": "simple_conditional_residuals", + "params": { + "conditional_cols": ["dow"], + "quantiles": [0.025, 0.975], + "quantile_estimation_method": "normal_fit", + "sample_size_thresh": 20, + "small_sample_size_method": "std_quantiles", + "small_sample_size_quantile": 0.98}}) + + past_df = train_df[[TIME_COL, VALUE_COL]].copy() + + # Tests with function ``simulate`` to check the expected functionality with ``min_admissible_value`` + trained_model["min_admissible_value"] = None + trained_model["max_admissible_value"] = None + + np.random.seed(21) + sim_df = silverkite.simulate( + fut_df=fut_df, + trained_model=trained_model, + past_df=past_df, + new_external_regressor_df=None, + include_err=True)["sim_df"] + # Asserts the predicted values when ``min_admissible_value`` and ``max_admissible_value`` are set to None + assert np.array_equal(np.round(sim_df[VALUE_COL]), [-10, -21, 18, -34, 18, -51, -16, -37, 10, -12]) + + # Tests with ``min_admissible_value`` + trained_model["min_admissible_value"] = -30 + trained_model["max_admissible_value"] = None + + np.random.seed(21) + sim_df = silverkite.simulate( + fut_df=fut_df, + trained_model=trained_model, + past_df=past_df, + new_external_regressor_df=None, + include_err=True)["sim_df"] + # Values below are clipped down to -30 as a result. + assert np.array_equal(np.round(sim_df[VALUE_COL]), [-10, -21, 18, -30, 20, -30, -5, -22, 23, 2]) + + # Tests with ``max_admissible_value`` + trained_model["min_admissible_value"] = None + trained_model["max_admissible_value"] = 0 + + np.random.seed(21) + sim_df = silverkite.simulate( + fut_df=fut_df, + trained_model=trained_model, + past_df=past_df, + new_external_regressor_df=None, + include_err=True)["sim_df"] + + # Values below are clipped up to 0 as a result. + assert np.array_equal(np.round(sim_df[VALUE_COL]), [-10, -21, 0, -42, 0, -63, -31, -51, -4, -26]) + + # Tests with the final predictions after multiple simulations (with ``predict_via_sim``) + trained_model["min_admissible_value"] = None + trained_model["max_admissible_value"] = None + + np.random.seed(10) + pred_df = silverkite.predict_via_sim( + fut_df=fut_df, + trained_model=trained_model, + past_df=past_df, + new_external_regressor_df=None, + simulation_num=5)["fut_df"] + assert np.array_equal(np.round(pred_df[VALUE_COL]), [-8, -21, -28, -8, -1, 6, 2, 17, 29, 17]) + + # Re-sets the ``min_admissible_value`` and ``max_admissible_value`` in such a way that some of the above predictions + # are outside the interval ``(min_admissible_value, max_admissible_value)`` + trained_model["min_admissible_value"] = -15 + trained_model["max_admissible_value"] = 10 + + # Makes a new prediction with ``min_admissible_value`` and ``max_admissible_value`` + np.random.seed(10) + pred_df = silverkite.predict_via_sim( + fut_df=fut_df, + trained_model=trained_model, + past_df=past_df, + new_external_regressor_df=None, + simulation_num=5)["fut_df"] + + # Predicted values are all between ``min_admissible_value`` and ``max_admissible_value``. Notice that the first prediction + # -8 changed to -1 though it was within the boundary. The reason was that the predicted values from ``predict_via_sim`` took + # an average of simulated predictions, and these simulated values were clipped (in ``simulate``). + assert np.array_equal(np.round(pred_df[VALUE_COL]), [-1, -5, -11, -2, 4, 0, 0, 4, 8, -2]) + + # Tests with ``predict_via_sim_fast`` (where ``include_err=False``) to verify that without error added, the forecasted + # value should not change until it exceeds the boundary for the first time. + trained_model["min_admissible_value"] = None + trained_model["max_admissible_value"] = None + + pred_df = silverkite.predict_via_sim_fast( + fut_df=fut_df, + trained_model=trained_model, + past_df=past_df, + new_external_regressor_df=None)["fut_df"] + assert np.array_equal(np.round(pred_df[VALUE_COL]), [-9, -18, -9, -10, -6, -5, -2, -1, 2, 4]) + + # Re-sets the ``min_admissible_value`` and ``max_admissible_value`` in such a way that some of the above predictions + # are outside the interval ``(min_admissible_value, max_admissible_value)`` + trained_model["min_admissible_value"] = -18 + trained_model["max_admissible_value"] = -5 + + # Makes a new prediction with ``min_admissible_value`` and ``max_admissible_value`` + pred_df = silverkite.predict_via_sim_fast( + fut_df=fut_df, + trained_model=trained_model, + past_df=past_df, + new_external_regressor_df=None)["fut_df"] + + # As expected, values only start to change when they first exceed the boundary + assert np.array_equal(np.round(pred_df[VALUE_COL]), [-9, -18, -9, -10, -6, -5, -5, -5, -5, -5]) diff --git a/greykite/tests/algo/forecast/silverkite/test_forecast_simple_silverkite.py b/greykite/tests/algo/forecast/silverkite/test_forecast_simple_silverkite.py index 2af1189..cfca1b4 100644 --- a/greykite/tests/algo/forecast/silverkite/test_forecast_simple_silverkite.py +++ b/greykite/tests/algo/forecast/silverkite/test_forecast_simple_silverkite.py @@ -48,10 +48,12 @@ def daily_data_reg(): remove_extra_cols=False)["df"] df["dow_categorical"] = df["str_dow"].astype("category") df = df[[cst.TIME_COL, cst.VALUE_COL, "regressor1", "regressor2", "regressor_bool", "str_dow", "dow_categorical"]] - df = df.rename({ - cst.TIME_COL: "custom_time_col", - cst.VALUE_COL: "custom_value_col" - }, axis=1) + df = df.rename( + { + cst.TIME_COL: "custom_time_col", + cst.VALUE_COL: "custom_value_col"}, + axis=1) + return df @@ -442,7 +444,7 @@ def test_extra_pred_cols(hourly_data, daily_data_reg, weekly_data): 'Mid-Autumn Festival\'])', 'C(Q(\'events_UnitedStates\'), levels=[\'\', "New Year\'s Day", \'Martin Luther King Jr. Day\', ' '"Washington\'s Birthday", \'Memorial Day\', \'Independence Day\', \'Independence Day (Observed)\', ' - '\'Labor Day\', \'Columbus Day\', \'Veterans Day\', \'Thanksgiving\', \'Christmas Day\', ' + '\'Labor Day\', \'Columbus Day\', \'Veterans Day\', \'Thanksgiving\', \'Christmas Day\', \'Halloween\', ' '\'Christmas Day (Observed)\', "New Year\'s Day (Observed)", \'Veterans Day (Observed)\', ' '\'Juneteenth National Independence Day\', \'Juneteenth National Independence Day (Observed)\'])'] @@ -573,6 +575,8 @@ def test_convert_simple_silverkite_params_hourly(hourly_data): fit_algorithm="ridge", fit_algorithm_params=None, daily_event_df_dict=expected_event, + daily_event_neighbor_impact=None, + daily_event_shifted_effect=None, changepoints_dict=None, fs_components_df=expected_fs, autoreg_dict=None, @@ -590,7 +594,8 @@ def test_convert_simple_silverkite_params_hourly(hourly_data): forecast_horizon=24, simulation_based=False, simulation_num=10, - fast_simulation=False + fast_simulation=False, + remove_intercept=False ) assert_equal(parameters, expected) @@ -793,6 +798,15 @@ def test_get_requested_seasonality_order(): default_order=5, is_enabled_auto=False) == 0 + assert silverkite._SimpleSilverkiteForecast__get_requested_seasonality_order( + requested_seasonality=None, + default_order=5, + is_enabled_auto=True) == 0 + assert silverkite._SimpleSilverkiteForecast__get_requested_seasonality_order( + requested_seasonality=None, + default_order=5, + is_enabled_auto=False) == 0 + assert silverkite._SimpleSilverkiteForecast__get_requested_seasonality_order( requested_seasonality=1, default_order=5, @@ -1431,11 +1445,38 @@ def test_auto_config_params(daily_data_reg): assert "ct1" in params["extra_pred_cols"] assert params["changepoints_dict"]["method"] == "custom" # Holidays is overridden by auto seasonality. - assert len(params["daily_event_df_dict"]) == 198 + assert len(params["daily_event_df_dict"]) == 203 assert "custom_event" in params["daily_event_df_dict"] assert "China_Chinese New Year" in params["daily_event_df_dict"] +def test_config_run_with_dst_features(daily_data_reg): + """Tests `extra_pred_cols` with daylight saving features along with auto + paramters + """ + df = daily_data_reg[["custom_time_col", "custom_value_col"]] + silverkite = SimpleSilverkiteForecast() + trained_model = silverkite.forecast_simple( + df=df, + time_col="custom_time_col", + value_col="custom_value_col", + forecast_horizon=1, + auto_holiday=True, + holidays_to_model_separately="auto", + holiday_lookup_countries="auto", + holiday_pre_num_days=2, + holiday_post_num_days=2, + extra_pred_cols=["eu_dst"], + auto_seasonality=True, + yearly_seasonality=0, + quarterly_seasonality="auto", + monthly_seasonality=False, + weekly_seasonality=True, + daily_seasonality=5) + + assert "eu_dst" in trained_model["pred_cols"] + + def test_auto_config_run(daily_data_reg): """Tests the auto options: @@ -1512,3 +1553,38 @@ def test_quantile_regression(): ) pred = silverkite.predict(df, trained_model=trained_model)["fut_df"] assert round(sum(pred[VALUE_COL] > df[VALUE_COL]) / len(pred), 1) == 0.9 + + +def test_holidays_with_neighbor_and_remove_intercept(): + """Tests holiday with additional neighbors and removing intercept.""" + dl = DataLoader() + df = dl.load_peyton_manning() + df[TIME_COL] = pd.to_datetime(df[TIME_COL]) + silverkite = SimpleSilverkiteForecast() + trained_model = silverkite.forecast_simple( + df=df, + time_col=TIME_COL, + value_col=VALUE_COL, + forecast_horizon=7, + holidays_to_model_separately=[], + holiday_lookup_countries=["US"], + holiday_pre_num_days=0, + holiday_post_num_days=0, + daily_event_shifted_effect=["7D"], + growth_term=None, + changepoints_dict=None, + yearly_seasonality=False, + quarterly_seasonality=False, + monthly_seasonality=False, + weekly_seasonality=False, + daily_seasonality=False, + feature_sets_enabled=False, + fit_algorithm="ridge", + remove_intercept=True, + extra_pred_cols=["0"] + ) + assert trained_model["drop_intercept_col"] == "C(Q('events_Other'), levels=['', 'event'])[]" + assert trained_model["daily_event_shifted_effect"] == ["7D"] + assert "C(Q('events_Other'), levels=['', 'event'])[]" not in trained_model["x_mat"].columns + assert "C(Q('events_Other_7D_after'), levels=['', 'event_7D_after'])[T.event_7D_after]" in trained_model["x_mat"].columns + silverkite.predict(df, trained_model=trained_model)["fut_df"] diff --git a/greykite/tests/algo/forecast/silverkite/test_forecast_simple_silverkite_helper.py b/greykite/tests/algo/forecast/silverkite/test_forecast_simple_silverkite_helper.py index 5ea1484..df0b636 100644 --- a/greykite/tests/algo/forecast/silverkite/test_forecast_simple_silverkite_helper.py +++ b/greykite/tests/algo/forecast/silverkite/test_forecast_simple_silverkite_helper.py @@ -309,3 +309,21 @@ def test_get_event_pred_cols(): f"C(Q('events_k1'), levels=['{EVENT_DEFAULT}', 'level1', 'level2'])", f"C(Q('events_k2'), levels=['{EVENT_DEFAULT}', 'a', 'b', 'c'])", f"C(Q('events_k3'), levels=['{EVENT_DEFAULT}', 'd'])"] + + +def test_get_event_pred_cols_with_neighbor(): + """Tests `get_event_pred_cols` with neighbor events.""" + + daily_event_df_dict = { + "k1": pd.DataFrame({EVENT_DF_LABEL_COL: ["level1", "level2", "level1"]}), + "k2": pd.DataFrame({EVENT_DF_LABEL_COL: ["a", "b", "c"]}), + "k3": pd.DataFrame({EVENT_DF_LABEL_COL: ["d", "d", "d"]}), + } + assert get_event_pred_cols(daily_event_df_dict, ["7D"]) == [ + f"C(Q('events_k1'), levels=['{EVENT_DEFAULT}', 'level1', 'level2'])", + f"C(Q('events_k1_7D_after'), levels=['{EVENT_DEFAULT}', 'level1_7D_after', 'level2_7D_after'])", + f"C(Q('events_k2'), levels=['{EVENT_DEFAULT}', 'a', 'b', 'c'])", + f"C(Q('events_k2_7D_after'), levels=['{EVENT_DEFAULT}', 'a_7D_after', 'b_7D_after', 'c_7D_after'])", + f"C(Q('events_k3'), levels=['{EVENT_DEFAULT}', 'd'])", + f"C(Q('events_k3_7D_after'), levels=['{EVENT_DEFAULT}', 'd_7D_after'])" + ] diff --git a/greykite/tests/algo/forecast/similarity/test_forecast_similarity_based.py b/greykite/tests/algo/forecast/similarity/test_forecast_similarity_based.py index 7fb586c..81ef5b7 100644 --- a/greykite/tests/algo/forecast/similarity/test_forecast_similarity_based.py +++ b/greykite/tests/algo/forecast/similarity/test_forecast_similarity_based.py @@ -1,14 +1,15 @@ import matplotlib # isort:skip -matplotlib.use("agg") # noqa: E402 -import matplotlib.pyplot as plt # isort:skip - from greykite.algo.forecast.similarity.forecast_similarity_based import forecast_similarity_based from greykite.common.evaluation import EvaluationMetricEnum from greykite.common.evaluation import calc_pred_err from greykite.common.testing_utils import generate_df_for_tests +matplotlib.use("agg") # noqa: E402 +import matplotlib.pyplot as plt # isort:skip # noqa: E402 + + def test_forecast_similarity_based(): """ Testing the function: forecast_similarity_based in various examples """ diff --git a/greykite/tests/algo/uncertainty/conditional/test_dataframe_utils.py b/greykite/tests/algo/uncertainty/conditional/test_dataframe_utils.py index ddfaa78..5ee242a 100644 --- a/greykite/tests/algo/uncertainty/conditional/test_dataframe_utils.py +++ b/greykite/tests/algo/uncertainty/conditional/test_dataframe_utils.py @@ -1,5 +1,5 @@ import pandas as pd -from pandas.util.testing import assert_frame_equal +from pandas.testing import assert_frame_equal from greykite.algo.uncertainty.conditional.dataframe_utils import limit_tuple_col from greykite.algo.uncertainty.conditional.dataframe_utils import offset_tuple_col diff --git a/greykite/tests/common/features/test_adjust_anomalous_data.py b/greykite/tests/common/features/test_adjust_anomalous_data.py index c925ea8..a86e960 100644 --- a/greykite/tests/common/features/test_adjust_anomalous_data.py +++ b/greykite/tests/common/features/test_adjust_anomalous_data.py @@ -8,6 +8,7 @@ from greykite.common.constants import START_TIME_COL from greykite.common.constants import TIME_COL from greykite.common.features.adjust_anomalous_data import adjust_anomalous_data +from greykite.common.features.adjust_anomalous_data import label_anomalies_multi_metric from greykite.common.testing_utils import generate_anomalous_data from greykite.common.testing_utils import generic_test_adjust_anomalous_data @@ -22,7 +23,7 @@ def test_adjust_anomalous_data(data): df_raw = data["df"] anomaly_df = data["anomaly_df"] - # Adjusts for `"y"` for `"MOBILE"` dimension. + # Adjusts for `"y"` for `"level_1"` dimension. value_col = "y" adj_df_info = adjust_anomalous_data( df=df_raw, @@ -34,7 +35,7 @@ def test_adjust_anomalous_data(data): adjustment_delta_col=ADJUSTMENT_DELTA_COL, filter_by_dict={ METRIC_COL: value_col, - "platform": "MOBILE"}, + "dimension1": "level_1"}, filter_by_value_col=None, adjustment_method="add") adj_values = pd.Series([np.nan, np.nan, 2., 6., 7., 8., 6., 7., 8., 9.]) @@ -43,7 +44,7 @@ def test_adjust_anomalous_data(data): adj_df_info=adj_df_info, adj_values=adj_values) - # Adjusts for `"y"` for `"MOBILE"` dimension and vertical `"sales"` or `"ads"`, + # Adjusts for `"y"` for `"level_1"` dimension and vertical `"level_2"` or `"level_1"`, # adjustment_method = `"subtract"` value_col = "y" adj_df_info = adjust_anomalous_data( @@ -55,8 +56,8 @@ def test_adjust_anomalous_data(data): end_time_col=END_TIME_COL, adjustment_delta_col=ADJUSTMENT_DELTA_COL, filter_by_dict={ - "platform": "MOBILE", - "vertical": ["sales", "ads"]}, + "dimension1": "level_1", + "dimension2": ["level_2", "level_1"]}, filter_by_value_col=METRIC_COL, adjustment_method="subtract") @@ -66,7 +67,7 @@ def test_adjust_anomalous_data(data): adj_df_info=adj_df_info, adj_values=adj_values) - # Adjusts for `"y"` for `"MOBILE"` dimension and vertical `"sales"`. + # Adjusts for `"y"` for `"level_1"` dimension and vertical `"level_2"`. value_col = "y" adj_df_info = adjust_anomalous_data( df=df_raw, @@ -78,8 +79,8 @@ def test_adjust_anomalous_data(data): adjustment_delta_col=ADJUSTMENT_DELTA_COL, filter_by_dict={ METRIC_COL: value_col, - "platform": "MOBILE", - "vertical": "sales"}, + "dimension1": "level_1", + "dimension2": "level_2"}, filter_by_value_col=None) adj_values = pd.Series([0., 1., 2., 6., 7., 8., 6., 7., 8., 9.]) @@ -101,16 +102,16 @@ def test_adjust_anomalous_data(data): adjustment_delta_col=ADJUSTMENT_DELTA_COL, filter_by_dict={ METRIC_COL: value_col, - "platform": "MOBILE"}, + "dimension1": "level_1"}, filter_by_value_col=None) - adj_values = pd.Series([20, 21, 22, 23, 24, 25, 26, 27, 28, 29]) + adj_values = pd.Series([20., 21., 22., 23., 24., 25., 26., 27., 28., 29.]) generic_test_adjust_anomalous_data( value_col=value_col, adj_df_info=adj_df_info, adj_values=adj_values) - # Adjusts for `"z"` with DESKTOP dimension passed. + # Adjusts for `"z"` with level_2 dimension passed. # Since for `"z"` Desktop is impacted, changes are expected. value_col = "z" adj_df_info = adjust_anomalous_data( @@ -123,7 +124,7 @@ def test_adjust_anomalous_data(data): adjustment_delta_col=ADJUSTMENT_DELTA_COL, filter_by_dict={ METRIC_COL: value_col, - "platform": "DESKTOP"}, + "dimension1": "level_2"}, filter_by_value_col=None) adj_values = pd.Series([20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 22.0, 23.0, np.nan]) @@ -224,8 +225,8 @@ def test_adjust_anomalous_data(data): adjustment_delta_col=None, filter_by_dict={ METRIC_COL: ("y", "z"), - "platform": {"MOBILE", "DESKTOP"}, - "vertical": ["ads", "sales"], + "dimension1": {"level_1", "level_2"}, + "dimension2": ["level_1", "level_2"], }, filter_by_value_col=None) @@ -239,8 +240,8 @@ def test_adjust_anomalous_data(data): # Successful date conversion for comparison. # The provided dates are strings in `anomaly_df_format`, datetime in `df_raw`. anomaly_df_format = anomaly_df.copy() - anomaly_df_format[START_TIME_COL] = ["1/1/2018", "1/4/2018", "1/8/2018", "1/10/2018"] - anomaly_df_format[END_TIME_COL] = ["1/2/2018", "1/6/2018", "1/9/2018", "1/10/2018"] + anomaly_df_format[START_TIME_COL] = ["1/1/2018", "1/4/2018", "1/8/2018", "1/10/2018", "1/1/2099"] + anomaly_df_format[END_TIME_COL] = ["1/2/2018", "1/6/2018", "1/9/2018", "1/10/2018", "1/1/2099"] value_col = "y" adj_df_info = adjust_anomalous_data( df=df_raw, @@ -252,8 +253,8 @@ def test_adjust_anomalous_data(data): adjustment_delta_col=None, filter_by_dict={ METRIC_COL: ("y", "z"), - "platform": {"MOBILE", "DESKTOP"}, - "vertical": ["ads", "sales"], + "dimension1": {"level_1", "level_2"}, + "dimension2": ["level_1", "level_2"], }, filter_by_value_col=None) @@ -265,8 +266,8 @@ def test_adjust_anomalous_data(data): adj_values=adj_values) # Checks failure to convert date - anomaly_df_format[START_TIME_COL] = ["999/999/2018", "1/4/2018", "1/8/2018", "1/10/2018"] - anomaly_df_format[END_TIME_COL] = ["999/999/2019", "1/6/2018", "1/9/2018", "1/10/2018"] + anomaly_df_format[START_TIME_COL] = ["999/999/2018", "1/4/2018", "1/8/2018", "1/10/2018", "1/1/2099"] + anomaly_df_format[END_TIME_COL] = ["999/999/2019", "1/6/2018", "1/9/2018", "1/10/2018", "1/1/2099"] with pytest.warns( UserWarning, match=r"Dates could not be parsed by `pandas.to_datetime`, using string comparison " @@ -282,8 +283,8 @@ def test_adjust_anomalous_data(data): adjustment_delta_col=None, filter_by_dict={ METRIC_COL: ("y", "z"), - "platform": {"MOBILE", "DESKTOP"}, - "vertical": ["ads", "sales"], + "dimension1": {"level_1", "level_2"}, + "dimension2": ["level_1", "level_2"], }, filter_by_value_col=None) @@ -334,3 +335,127 @@ def test_adjust_anomalous_data(data): adjustment_delta_col=None, filter_by_dict={"device": "iPhone"}, filter_by_value_col=None) + + +def test_label_anomalies_multi_metric(): + """Tests ``label_anomalies_multi_metric``""" + anomaly_df = pd.DataFrame({ + "start_time": ["2020-01-01", "2020-02-01", "2020-01-02", "2020-02-02", "2020-02-05"], + "end_time": ["2020-01-03", "2020-02-04", "2020-01-05", "2020-02-06", "2020-02-08"], + "metric": ["impressions", "impressions", "clicks", "clicks", "bookings"] + }) + + ts = pd.date_range(start="2019-12-01", end="2020-03-01", freq="D") + + np.random.seed(1317) + df = pd.DataFrame({"ts": ts}) + size = len(df) + value_cols = ["impressions", "clicks", "bookings"] + df["impressions"] = np.random.normal(loc=0.0, scale=1.0, size=size) + df["clicks"] = np.random.normal(loc=1.0, scale=1.0, size=size) + df["bookings"] = np.random.normal(loc=2.0, scale=1.0, size=size) + + res = label_anomalies_multi_metric( + df=df, + time_col="ts", + value_cols=value_cols, + anomaly_df=anomaly_df, + anomaly_df_grouping_col="metric", + start_time_col="start_time", + end_time_col="end_time") + + augmented_df = res["augmented_df"] + is_anomaly_cols = res["is_anomaly_cols"] + anomaly_value_cols = res["anomaly_value_cols"] + normal_value_cols = res["normal_value_cols"] + + assert len(augmented_df) == len(df) + + assert is_anomaly_cols == [ + "impressions_is_anomaly", + "clicks_is_anomaly", + "bookings_is_anomaly"] + + assert anomaly_value_cols == [ + "impressions_anomaly_value", + "clicks_anomaly_value", + "bookings_anomaly_value"] + + assert normal_value_cols == [ + "impressions_normal_value", + "clicks_normal_value", + "bookings_normal_value"] + + assert set(augmented_df.columns) == set( + ["ts"] + + value_cols + + is_anomaly_cols + + anomaly_value_cols + + normal_value_cols) + + # From the above data its clear this is the range for bookings anomalies and non-anomalies + bookings_anomaly_time_ind = (df["ts"] >= "2020-02-05") & (df["ts"] <= "2020-02-08") + bookings_normal_time_ind = (df["ts"] < "2020-02-05") | (df["ts"] > "2020-02-08") + + # We expect this to only consist of missing values + x = augmented_df.loc[bookings_anomaly_time_ind]["bookings_normal_value"] + assert x.isnull().all() + + # We expect this to only consist of missing values + x = augmented_df.loc[bookings_normal_time_ind, "bookings_anomaly_value"] + assert x.isnull().all() + + # Checks for anomaly values + x = augmented_df.loc[bookings_anomaly_time_ind, "bookings_anomaly_value"] + y = df.loc[bookings_anomaly_time_ind, "bookings"] + assert (x == y).all() + + # Checks for normal values + x = augmented_df.loc[bookings_normal_time_ind, "bookings_normal_value"] + y = df.loc[bookings_normal_time_ind, "bookings"] + assert (x == y).all() + + # Tests for raising ``ValueError`` due to non existing columns + expected_match = "time_col:" + with pytest.raises(ValueError, match=expected_match): + label_anomalies_multi_metric( + df=df, + time_col="timestamp", # non-existing column name + value_cols=value_cols, + anomaly_df=anomaly_df, + anomaly_df_grouping_col="metric", + start_time_col="start_time", + end_time_col="end_time") + + expected_match = "value_col:" + with pytest.raises(ValueError, match=expected_match): + label_anomalies_multi_metric( + df=df, + time_col="ts", + value_cols=["abandons", "impressions"], # non-existing column name + anomaly_df=anomaly_df, + anomaly_df_grouping_col="metric", + start_time_col="start_time", + end_time_col="end_time") + + expected_match = "start_time_col:" + with pytest.raises(ValueError, match=expected_match): + label_anomalies_multi_metric( + df=df, + time_col="ts", + value_cols=["impressions"], + anomaly_df=anomaly_df, + anomaly_df_grouping_col="metric", + start_time_col="start_ts", # non-existing column name + end_time_col="end_time") + + expected_match = "end_time_col:" + with pytest.raises(ValueError, match=expected_match): + label_anomalies_multi_metric( + df=df, + time_col="ts", + value_cols=["impressions"], + anomaly_df=anomaly_df, + anomaly_df_grouping_col="metric", + start_time_col="start_time", + end_time_col="end_ts") # non-existing column name diff --git a/greykite/tests/common/features/test_timeseries_features.py b/greykite/tests/common/features/test_timeseries_features.py index f782c1d..95e74ee 100644 --- a/greykite/tests/common/features/test_timeseries_features.py +++ b/greykite/tests/common/features/test_timeseries_features.py @@ -1,5 +1,6 @@ import datetime from datetime import datetime as dt +from datetime import timedelta import numpy as np import pandas as pd @@ -9,6 +10,9 @@ from greykite.common.constants import EVENT_DF_DATE_COL from greykite.common.constants import EVENT_DF_LABEL_COL from greykite.common.constants import EVENT_PREFIX +from greykite.common.constants import IS_EVENT_ADJACENT_COL +from greykite.common.constants import IS_EVENT_COL +from greykite.common.constants import IS_EVENT_EXACT_COL from greykite.common.constants import TIME_COL from greykite.common.features.timeseries_features import add_daily_events from greykite.common.features.timeseries_features import add_event_window @@ -28,12 +32,18 @@ from greykite.common.features.timeseries_features import get_changepoint_values_from_config from greykite.common.features.timeseries_features import get_custom_changepoints_values from greykite.common.features.timeseries_features import get_default_origin_for_time_vars +from greykite.common.features.timeseries_features import get_eu_dst_end +from greykite.common.features.timeseries_features import get_eu_dst_start from greykite.common.features.timeseries_features import get_evenly_spaced_changepoints_dates from greykite.common.features.timeseries_features import get_evenly_spaced_changepoints_values from greykite.common.features.timeseries_features import get_fourier_col_name from greykite.common.features.timeseries_features import get_holidays from greykite.common.features.timeseries_features import get_logistic_func +from greykite.common.features.timeseries_features import get_us_dst_end +from greykite.common.features.timeseries_features import get_us_dst_start +from greykite.common.features.timeseries_features import is_dst_fcn from greykite.common.features.timeseries_features import logistic +from greykite.common.features.timeseries_features import pytz_is_dst_fcn from greykite.common.features.timeseries_features import signed_pow from greykite.common.features.timeseries_features import signed_pow_fcn from greykite.common.features.timeseries_features import signed_sqrt @@ -48,14 +58,47 @@ def hourly_data(): return generate_df_for_tests(freq="H", periods=24 * 500) +# Below defines a function to do all tests for daylight saving functions +def validate_dst_times(is_dst_func, dst_start, dst_end): + """Assuming dayligh saving starts on ``dst_start`` and ends on ``dst_end``, + it performs a series of assertion to ensure the function + ``is_dst_func`` works as expected.""" + # It must be daylight saving on the start date + dt = dst_start + assert is_dst_func([dt])[0] is True, "dst start not correct" + # It must be still daylight saving an hour after the start date + assert is_dst_func([dt + pd.to_timedelta("1H")])[0] is True, "an hour after start of dst" + # It must be still daylight saving 2 hours after the start date + assert is_dst_func([dt + pd.to_timedelta("2H")])[0] is True, "2 hours after start of dst" + # It must be still daylight saving 90 days after the start date + assert is_dst_func([dt + pd.to_timedelta("90D")])[0] is True, "90 days after start of dst" + # It is not daylight saving an hour before the start date + assert is_dst_func([dt - pd.to_timedelta("1H")])[0] is False, "an hour before start of dst" + + # It must be daylight saving right before the end date + dt = dst_end + assert is_dst_func([dt])[0] is True + # It must be still daylight saving an hour before the end date + assert is_dst_func([dt - pd.to_timedelta("1H")])[0] is True, "an hour before end of dst" + # It must be still daylight saving 2 hours before the end date + assert is_dst_func([dt - pd.to_timedelta("2H")])[0] is True, "2 hours before end of dst" + # It must be still daylight saving 90 days before the end date + assert is_dst_func([dt - pd.to_timedelta("90D")])[0] is True, "90 days before end of dst" + # It is not daylight saving an hour after the end date + assert is_dst_func([dt + pd.to_timedelta("1H")])[0] is False, "an hour the end of dst" + + def test_convert_date_to_continuous_time(): assert convert_date_to_continuous_time(dt(2019, 1, 1)) == 2019.0 assert convert_date_to_continuous_time(dt(2019, 7, 1)) == 2019 + 181 / 365 assert convert_date_to_continuous_time(dt(2020, 7, 1)) == 2020 + 182 / 366 # leap year - assert convert_date_to_continuous_time(dt(2019, 7, 1, 7, 4, 24)) == 2019 + (181 # day - + 7 / 24 # hour - + 4 / (24 * 60) # minute - + 24 / (24 * 60 * 60)) / 365 # second + assert convert_date_to_continuous_time(dt(2019, 7, 1, 7, 4, 24)) == ( + 2019 + ( + 181 # day + + 7 / 24 # hour + + 4 / (24 * 60) # minute + + 24 / (24 * 60 * 60)) / 365 # second + ) def test_get_default_origin_for_time_vars(hourly_data): @@ -69,6 +112,104 @@ def test_get_default_origin_for_time_vars(hourly_data): assert round(conti_year_origin, 3) == 2018.496 +def test_pytz_is_dst_fcn(): + """Tests ``pytz_is_dst_fcn``""" + # Tests the function constructed for US + # Note that as long as the zone is inside mainland US, + # result should be the same + us_dst_start = pd.to_datetime("2022-03-13 02:00:00") + us_dst_end = pd.to_datetime("2022-11-06 01:59:59") + validate_dst_times( + is_dst_func=pytz_is_dst_fcn("US/Pacific"), + dst_start=us_dst_start, + dst_end=us_dst_end) + + # Repeates the US case, with another time zone + validate_dst_times( + is_dst_func=pytz_is_dst_fcn("US/Central"), + dst_start=us_dst_start, + dst_end=us_dst_end) + + # Europe case + eu_dst_start = pd.to_datetime("2022-03-27 01:00:00") + eu_dst_end = pd.to_datetime("2022-10-30 01:59:59") + validate_dst_times( + is_dst_func=pytz_is_dst_fcn("Europe/London"), + dst_start=eu_dst_start, + dst_end=eu_dst_end) + + +def test_is_dst_fcn(): + """Tests ``is_dst_fcn``""" + # Tests the function constructed for US + # Note that as long as the zone is inside mainland US, + # result should be the same + us_dst_start = pd.to_datetime("2022-03-13 02:00:00") + us_dst_end = pd.to_datetime("2022-11-06 01:59:59") + validate_dst_times( + is_dst_func=is_dst_fcn("US/Pacific"), + dst_start=us_dst_start, + dst_end=us_dst_end) + + # Repeates the US case, with another time zone + validate_dst_times( + is_dst_func=is_dst_fcn("US/Central"), + dst_start=us_dst_start, + dst_end=us_dst_end) + + # Europe case + eu_dst_start = pd.to_datetime("2022-03-27 01:00:00") + eu_dst_end = pd.to_datetime("2022-10-30 01:59:59") + validate_dst_times( + is_dst_func=is_dst_fcn("Europe/London"), + dst_start=eu_dst_start, + dst_end=eu_dst_end) + + +def test_get_dst_start_end_date(): + """Tests `get_us_dst_start`, `get_us_dst_end`, + `get_eu_dst_start`, `get_eu_dst_end` functions. + """ + years = [2015, 2022, 2023, 2024] + expected_us_dst_start_dates = [ + dt(2015, 3, 8, 2, 0), + dt(2022, 3, 13, 2, 0), + dt(2023, 3, 12, 2, 0), + dt(2024, 3, 10, 2, 0) + ] + expected_us_dst_end_dates = [ + dt(2015, 11, 1, 2, 0), + dt(2022, 11, 6, 2, 0), + dt(2023, 11, 5, 2, 0), + dt(2024, 11, 3, 2, 0) + ] + expected_eu_dst_start_dates = [ + dt(2015, 3, 29, 1, 0), + dt(2022, 3, 27, 1, 0), + dt(2023, 3, 26, 1, 0), + dt(2024, 3, 31, 1, 0) + ] + expected_eu_dst_end_dates = [ + dt(2015, 10, 25, 2, 0), + dt(2022, 10, 30, 2, 0), + dt(2023, 10, 29, 2, 0), + dt(2024, 10, 27, 2, 0) + ] + us_dst_start_dates = [] + us_dst_end_dates = [] + eu_dst_start_dates = [] + eu_dst_end_dates = [] + for year in years: + us_dst_start_dates.append(get_us_dst_start(year)) + us_dst_end_dates.append(get_us_dst_end(year)) + eu_dst_start_dates.append(get_eu_dst_start(year)) + eu_dst_end_dates.append(get_eu_dst_end(year)) + assert us_dst_start_dates == expected_us_dst_start_dates + assert us_dst_end_dates == expected_us_dst_end_dates + assert eu_dst_start_dates == expected_eu_dst_start_dates + assert eu_dst_end_dates == expected_eu_dst_end_dates + + def test_build_time_features_df(): date_list = pd.date_range( start=dt(2019, 1, 1), @@ -121,8 +262,32 @@ def test_build_time_features_df(): assert time_df["dow_grouped"][24 * 4] == "6-Sat" assert time_df["dow_grouped"][24 * 5] == "7-Sun" # detailed check on dow_hr - assert list(time_df["dow_hr"])[::7][:25] == ['2_00', '2_07', '2_14', '2_21', '3_04', '3_11', '3_18', '4_01', '4_08', '4_15', '4_22', '5_05', '5_12', '5_19', - '6_02', '6_09', '6_16', '6_23', '7_06', '7_13', '7_20', '1_03', '1_10', '1_17', '2_00'] # noqa: E501 + assert list(time_df["dow_hr"])[::7][:25] == [ + "2_00", + "2_07", + "2_14", + "2_21", + "3_04", + "3_11", + "3_18", + "4_01", + "4_08", + "4_15", + "4_22", + "5_05", + "5_12", + "5_19", + "6_02", + "6_09", + "6_16", + "6_23", + "7_06", + "7_13", + "7_20", + "1_03", + "1_10", + "1_17", + "2_00"] # noqa: E501 assert time_df["ct1"][0] == 0.0 assert time_df["ct2"][0] == 0.0 @@ -184,6 +349,22 @@ def test_build_time_features_df(): build_time_features_df(dt=df0.iloc[0:0]["ts"], conti_year_origin=2019) +def test_build_time_features_df_with_dst(): + date_list = pd.date_range( + start=dt(2022, 3, 13), + periods=5, + freq="H").tolist() + + df0 = pd.DataFrame({"ts": date_list}) + time_df = build_time_features_df( + dt=df0["ts"], + conti_year_origin=2022, + add_dst_info=True) + + assert (time_df["us_dst"] == [False, False, True, True, True]).all() + assert (time_df["eu_dst"] == [False]*5).all() + + def test_build_time_features_df_leap_years(): date_list_non_leap_year = pd.date_range( start=dt(2019, 2, 28), @@ -318,6 +499,7 @@ def test_get_available_holidays_in_countries(): "Christmas Day", "Christmas Day (Observed)", "Columbus Day", + "Halloween", "Independence Day", "Independence Day (Observed)", "Juneteenth National Independence Day", @@ -345,6 +527,7 @@ def test_get_available_holidays_across_countries(): "Christmas Day (Observed)", "Columbus Day", "Dragon Boat Festival", + "Halloween", "Independence Day", "Independence Day (Observed)", "Juneteenth National Independence Day", @@ -365,12 +548,13 @@ def test_get_available_holidays_across_countries(): def test_add_daily_events(): - # generate events dictionary + """Tests ``add_daily_events`` function.""" + # Generates events dictionary countries = ["US", "India", "UK"] event_df_dict = get_holidays(countries, year_start=2015, year_end=2025) original_col_names = [event_df_dict[country].columns[1] for country in countries] - # generate temporal data + # Generates temporal data date_list = pd.date_range( start=dt(2019, 1, 1), periods=100, @@ -383,10 +567,81 @@ def test_add_daily_events(): assert df_with_events[f"{EVENT_PREFIX}_India"].values[0] == "New Year's Day" assert df_with_events[f"{EVENT_PREFIX}_US"].values[25] == "" - # makes sure the function does not modify the input + # Makes sure the function does not modify the input new_col_names = [event_df_dict[country].columns[1] for country in countries] assert original_col_names == new_col_names + # Tests event indicators are correctly included + expected_new_cols = [f"events_{country}" for country in countries] + \ + [IS_EVENT_EXACT_COL, IS_EVENT_ADJACENT_COL, IS_EVENT_COL] # 6 columns should be added + assert set(df_with_events.columns).difference(df.columns) == set(expected_new_cols) + num_is_event = df_with_events[IS_EVENT_COL].sum() + num_is_event_exact = df_with_events[IS_EVENT_EXACT_COL].sum() + num_is_event_adjacent = df_with_events[IS_EVENT_ADJACENT_COL].sum() + assert num_is_event == num_is_event_exact + assert num_is_event_exact > 0 + assert num_is_event_adjacent == 0 + assert num_is_event == num_is_event_exact + num_is_event_adjacent + + +def test_add_daily_events_with_neighbor_impact(): + """Tests adding daily events with neighbor impact.""" + # Tests weekly data. + df = pd.DataFrame({ + "date": pd.date_range("2020-01-01", freq="W-SUN", periods=100), + "y": 0 + }) + countries = ["US"] + event_df_dict = get_holidays(countries, year_start=2015, year_end=2025) + new_df = add_daily_events( + df, + event_df_dict, + neighbor_impact=lambda x: [x - timedelta(days=x.isocalendar()[2] - 1) + timedelta(days=i) for i in range(7)] + ) + # Checks holidays are mapped to the correct weekly dates. + assert new_df.iloc[0].tolist() == [pd.Timestamp("2020-01-05"), 0, "New Year's Day", 1, 0, 1] + assert new_df.iloc[-1].tolist() == [pd.Timestamp("2021-11-28"), 0, "Thanksgiving", 1, 0, 1] + + # Tests daily data, assuming rolling 7 day. + df = pd.DataFrame({ + "date": pd.date_range("2020-01-01", freq="D", periods=500), + "y": 0 + }) + countries = ["US"] + event_df_dict = get_holidays(countries, year_start=2015, year_end=2025) + new_df = add_daily_events( + df, + event_df_dict, + neighbor_impact=7 + ) + # Checks holidays are mapped to the correct weekly dates. + assert new_df.iloc[0].tolist() == [pd.Timestamp("2020-01-01"), 0, "Christmas Day", 1, 0, 1] + assert new_df.iloc[1].tolist() == [pd.Timestamp("2020-01-01"), 0, "New Year's Day", 1, 0, 1] + assert new_df.iloc[2].tolist() == [pd.Timestamp("2020-01-02"), 0, "New Year's Day", 1, 0, 1] + + +def test_add_daily_event_shifted_effect(): + """Tests adding additional neighbor events. + The additional events are added as extra columns rather than extra dates + under the same columns as in ``neighbor_effect``. + """ + df = pd.DataFrame({ + "date": pd.date_range("2020-01-01", freq="W-SUN", periods=100), + "y": 0 + }) + countries = ["US"] + event_df_dict = get_holidays(countries, year_start=2015, year_end=2025) + new_df = add_daily_events( + df, + event_df_dict, + neighbor_impact=lambda x: [x - timedelta(days=x.isocalendar()[2] - 1) + timedelta(days=i) for i in range(7)], + shifted_effect=["-7D", "7D"] + ) + assert new_df.iloc[0].tolist() == [pd.Timestamp("2020-01-05"), 0, "New Year's Day", "", "Christmas Day_7D_after", 1, 0, 1] + assert new_df.iloc[1].tolist() == [pd.Timestamp("2020-01-12"), 0, "", "", "New Year's Day_7D_after", 1, 0, 1] + assert new_df.iloc[2].tolist() == [pd.Timestamp("2020-01-19"), 0, "", "Martin Luther King Jr. Day_7D_before", "", 1, 0, 1] + assert new_df.iloc[-1].tolist() == [pd.Timestamp("2021-11-28"), 0, "Thanksgiving", "", "", 1, 0, 1] + def test_get_evenly_spaced_changepoints(): df = pd.DataFrame({"time_col": np.arange(1, 11)}) @@ -521,7 +776,10 @@ def test_get_evenly_spaced_changepoint_values(): df0 = pd.DataFrame({"ts": date_list}) df = add_time_features_df(df0, time_col="ts", conti_year_origin=2018) - changepoints = get_evenly_spaced_changepoints_values(df, "ct1", n_changepoints=n_changepoints) + changepoints = get_evenly_spaced_changepoints_values( + df, + "ct1", + n_changepoints=n_changepoints) changepoint_dates = get_changepoint_dates_from_changepoints_dict( changepoints_dict={ "method": "uniform", @@ -553,7 +811,8 @@ def test_get_custom_changepoints(): df = add_time_features_df(df0, time_col="custom_time_col", conti_year_origin=2018) # dates as datetime - changepoint_dates = pd.to_datetime(["2018-01-01", "2019-01-02-16", "2019-01-03", "2019-02-01"]) + changepoint_dates = pd.to_datetime( + ["2018-01-01", "2019-01-02-16", "2019-01-03", "2019-02-01"]) result = get_custom_changepoints_values( df=df, changepoint_dates=changepoint_dates, @@ -615,7 +874,8 @@ def test_get_custom_changepoints(): df = add_time_features_df(df0, time_col="custom_time_col", conti_year_origin=2018) # dates as datetime - changepoint_dates = pd.to_datetime(["2018-01-01", "2019-01-02-16", "2019-01-03", "2019-02-01"]) + changepoint_dates = pd.to_datetime( + ["2018-01-01", "2019-01-02-16", "2019-01-03", "2019-02-01"]) result = get_custom_changepoints_values( df=df, changepoint_dates=changepoint_dates, @@ -624,7 +884,8 @@ def test_get_custom_changepoints(): ) # 2018-01-01 is mapped to 2019-01-01-00. Mapped to -00 if no hour provided # Last requested changepoint is not found - assert np.all(result == pd.to_datetime(["2019-01-01-00", "2019-01-02-16", "2019-01-03-00"])) + assert np.all(result == pd.to_datetime( + ["2019-01-01-00", "2019-01-02-16", "2019-01-03-00"])) def test_get_changepoint_values_from_config(hourly_data): @@ -850,7 +1111,10 @@ def test_get_changepoint_dates_from_changepoints_dict(): def test_add_event_window(): """Tests add_event_window""" # generate events data - event_df_dict = get_holidays(countries=["US", "India", "UK"], year_start=2018, year_end=2019) + event_df_dict = get_holidays( + countries=["US", "India", "UK"], + year_start=2018, + year_end=2019) df = event_df_dict["US"] shifted_event_dict = add_event_window( @@ -884,7 +1148,10 @@ def test_add_event_window(): def test_add_event_window_multi(): # generating events data - event_df_dict = get_holidays(countries=["US", "India", "UK"], year_start=2018, year_end=2019) + event_df_dict = get_holidays( + countries=["US", "India", "UK"], + year_start=2018, + year_end=2019) shifted_event_dict = add_event_window_multi( event_df_dict=event_df_dict, diff --git a/greykite/tests/common/features/test_timeseries_impute.py b/greykite/tests/common/features/test_timeseries_impute.py index 5f42b09..fd8d784 100644 --- a/greykite/tests/common/features/test_timeseries_impute.py +++ b/greykite/tests/common/features/test_timeseries_impute.py @@ -1,9 +1,9 @@ import numpy as np import pandas as pd -from pandas.util.testing import assert_equal from greykite.common.features.timeseries_impute import impute_with_lags from greykite.common.features.timeseries_impute import impute_with_lags_multi +from greykite.common.python_utils import assert_equal def test_impute_with_lags(): diff --git a/greykite/tests/common/test_aggregation_function_enum.py b/greykite/tests/common/test_aggregation_function_enum.py index 67b939e..3928599 100644 --- a/greykite/tests/common/test_aggregation_function_enum.py +++ b/greykite/tests/common/test_aggregation_function_enum.py @@ -11,4 +11,5 @@ def test_aggregate_function_enum(): assert AggregationFunctionEnum.nanmean.value(array) == 3 assert AggregationFunctionEnum.maximum.value(array) == 6 assert AggregationFunctionEnum.minimum.value(array) == 1 + assert AggregationFunctionEnum.sum.value(array) == 9 assert AggregationFunctionEnum.weighted_average.value(array) == 3 diff --git a/greykite/tests/common/test_data_loader.py b/greykite/tests/common/test_data_loader.py index 5c8ff02..d4842f4 100644 --- a/greykite/tests/common/test_data_loader.py +++ b/greykite/tests/common/test_data_loader.py @@ -54,7 +54,7 @@ def test_get_data_names(): def test_get_aggregated_data(): dl = DataLoader() test_df = pd.DataFrame({ - TIME_COL: pd.date_range("2020-01-01 00:00", "2020-12-31 23:00", freq="1H"), + TIME_COL: pd.date_range("2020-01-01 00:00", "2020-12-31 23:30", freq="0.5H"), "col1": 1, "col2": 2, "col3": 3, @@ -65,10 +65,18 @@ def test_get_aggregated_data(): # For each frequency, # (1) make sure the `TIME_COL` column is correctly included # (2) verify the aggregation part works correctly + # Hourly aggregation + df = dl.get_aggregated_data(test_df, agg_freq="hourly", agg_func=agg_func) + assert df.shape == (366*24, len(agg_func) + 1) + assert (df["col1"] != 1*2).sum() == 0 + assert (df["col2"] != 2).sum() == 0 + assert (df["col3"] != 3).sum() == 0 + assert (df["col4"] != 4).sum() == 0 + assert (df["col5"] != 5).sum() == 0 # Daily aggregation df = dl.get_aggregated_data(test_df, agg_freq="daily", agg_func=agg_func) assert df.shape == (366, len(agg_func) + 1) - assert (df["col1"] != 24).sum() == 0 + assert (df["col1"] != 24*2).sum() == 0 assert (df["col2"] != 2).sum() == 0 assert (df["col3"] != 3).sum() == 0 assert (df["col4"] != 4).sum() == 0 @@ -76,7 +84,7 @@ def test_get_aggregated_data(): # Weekly aggregation df = dl.get_aggregated_data(test_df, agg_freq="weekly", agg_func=agg_func) assert df.shape == (53, len(agg_func) + 1) - assert (df["col1"] != 24*7).sum() == 2 + assert (df["col1"] != 24*7*2).sum() == 2 assert (df["col2"] != 2).sum() == 0 assert (df["col3"] != 3).sum() == 0 assert (df["col4"] != 4).sum() == 0 @@ -84,7 +92,7 @@ def test_get_aggregated_data(): # Monthly aggregation df = dl.get_aggregated_data(test_df, agg_freq="monthly", agg_func=agg_func) assert df.shape == (12, len(agg_func) + 1) - assert (df["col1"].isin([24*29, 24*30, 24*31])).sum() == 12 + assert (df["col1"].isin([24*29*2, 24*30*2, 24*31*2])).sum() == 12 assert (df["col2"] != 2).sum() == 0 assert (df["col3"] != 3).sum() == 0 assert (df["col4"] != 4).sum() == 0 diff --git a/greykite/tests/common/test_evaluation.py b/greykite/tests/common/test_evaluation.py index 52add4e..a441870 100644 --- a/greykite/tests/common/test_evaluation.py +++ b/greykite/tests/common/test_evaluation.py @@ -35,6 +35,7 @@ from greykite.common.evaluation import fraction_outside_tolerance from greykite.common.evaluation import fraction_within_bands from greykite.common.evaluation import mean_absolute_percent_error +from greykite.common.evaluation import mean_interval_score from greykite.common.evaluation import median_absolute_percent_error from greykite.common.evaluation import prediction_band_width from greykite.common.evaluation import quantile_loss @@ -551,6 +552,90 @@ def test_fraction_within_bands(): assert "1 of 4 upper bound values are smaller than the lower bound" in record[0].message.args[0] +def test_mean_interval_score(): + """Tests `mean_interval_score` function.""" + + # Checks that the mean interval score is the same regardless of coverage if lower < observed < upper. + mean_interval_scores = [mean_interval_score( + lower=[0., 0., 0., 0.], + observed=[2., 2., 2., 2.], + upper=[5., 5., 5., 5.], + coverage=coverage) for coverage in np.linspace(0, 1, num=21)] + assert all([score == 5.0 for score in mean_interval_scores]) + + # Checks that the mean interval score increases as a function of coverage if observed is not between lower and upper. + mean_interval_scores = [mean_interval_score( + lower=[0.], + observed=[-2.], + upper=[5.], + coverage=coverage) for coverage in np.linspace(0, 1, num=21)] + assert all(i < j for i, j in zip(mean_interval_scores, mean_interval_scores[1:])) + + # Checks that the mean interval score equals infinity if any of lower or upper is infinity. + assert mean_interval_score( + lower=[-np.Inf, 0., 0., 0.], + observed=[2., 2., 2., 2.], + upper=[5., 5., 5., 5.], + coverage=0.95 + ) == float("inf") + + # Checks that we get expected value when observed is less than lower. + lower = 0.0 + upper = 5.0 + observed = -2.0 + coverage = 0.95 + expected = upper - lower + 2.0 * (lower - observed) / (1.0 - coverage) + assert mean_interval_score( + lower=[lower], + observed=[observed], + upper=[upper], + coverage=coverage + ) == pytest.approx(expected) + + # Checks that we get expected value when observed is greater than upper. + lower = 0.0 + upper = 5.0 + observed = 7.0 + coverage = 0.95 + expected = upper - lower + 2.0 * (observed - upper) / (1.0 - coverage) + assert mean_interval_score( + lower=[lower], + observed=[observed], + upper=[upper], + coverage=coverage + ) == pytest.approx(expected) + + with pytest.warns(UserWarning) as record: + # Checks that the mean interval score returns the correct value even if there are `np.nan` in the observed. + assert mean_interval_score( + lower=[0., 0., 0., 0.], + observed=[2., np.nan, 2., 2.], + upper=[5., 5., 5., 5.], + coverage=0.95 + ) == 5.0 + assert "1 value(s) in y_true were NA or infinite and are omitted in error calc" in record[0].message.args[0] + + with pytest.warns(UserWarning) as record: + # Checks that the mean interval score returns the correct value even if there are `np.Inf` in the observed. + assert mean_interval_score( + lower=[0., 0., 0., 0.], + observed=[2., np.Inf, 2., 2.], + upper=[5., 5., 5., 5.], + coverage=0.95 + ) == 5.0 + assert "1 value(s) in y_true were NA or infinite and are omitted in error calc" in record[0].message.args[0] + + with pytest.warns(UserWarning) as record: + # Checks that the mean interval score returns the correct value even if there are `np.nan` in lower or upper. + assert mean_interval_score( + lower=[0., 0., np.nan, 0.], + observed=[2., 2., 2., 2.], + upper=[5., 5., 5., 5.], + coverage=0.95 + ) == 5.0 + assert "1 value(s) in lower/upper bounds were NA and are omitted in error calc" in record[0].message.args[0] + + def test_prediction_band_width(): """Tests prediction_band_width function""" assert prediction_band_width( @@ -781,13 +866,24 @@ def test_validation_metric_enum(): """Tests ValidationMetricEnum accessors""" assert ValidationMetricEnum.BAND_WIDTH == ValidationMetricEnum["BAND_WIDTH"] assert ValidationMetricEnum.BAND_WIDTH.name == "BAND_WIDTH" - assert ValidationMetricEnum.BAND_WIDTH.value == (prediction_band_width, False) + assert ValidationMetricEnum.BAND_WIDTH.value == (prediction_band_width, False, "band_width") # tuple access works as usual assert ValidationMetricEnum["BAND_WIDTH"].value[0] == prediction_band_width assert ValidationMetricEnum.BAND_WIDTH.get_metric_func() == prediction_band_width + assert ValidationMetricEnum.BAND_WIDTH.get_metric_name() == "band_width" assert not ValidationMetricEnum["BAND_WIDTH"].value[1] assert not ValidationMetricEnum.BAND_WIDTH.get_metric_greater_is_better() assert len(", ".join(ValidationMetricEnum.__dict__["_member_names_"])) > 0 + assert ValidationMetricEnum.BAND_COVERAGE == ValidationMetricEnum["BAND_COVERAGE"] assert ValidationMetricEnum.BAND_COVERAGE.name == "BAND_COVERAGE" + assert ValidationMetricEnum.BAND_COVERAGE.value == (fraction_within_bands, True, "band_coverage") + assert ValidationMetricEnum.BAND_COVERAGE.get_metric_func() == fraction_within_bands + assert ValidationMetricEnum.BAND_COVERAGE.get_metric_name() == "band_coverage" + + assert ValidationMetricEnum.MEAN_INTERVAL_SCORE == ValidationMetricEnum["MEAN_INTERVAL_SCORE"] + assert ValidationMetricEnum.MEAN_INTERVAL_SCORE.name == "MEAN_INTERVAL_SCORE" + assert ValidationMetricEnum.MEAN_INTERVAL_SCORE.value == (mean_interval_score, False, "MIS") + assert ValidationMetricEnum.MEAN_INTERVAL_SCORE.get_metric_func() == mean_interval_score + assert ValidationMetricEnum.MEAN_INTERVAL_SCORE.get_metric_name() == "MIS" diff --git a/greykite/tests/framework/benchmark/test_gen_moving_timeseries_forecast.py b/greykite/tests/common/test_gen_moving_timeseries_forecast.py similarity index 97% rename from greykite/tests/framework/benchmark/test_gen_moving_timeseries_forecast.py rename to greykite/tests/common/test_gen_moving_timeseries_forecast.py index 5a57179..f5ba1ca 100644 --- a/greykite/tests/framework/benchmark/test_gen_moving_timeseries_forecast.py +++ b/greykite/tests/common/test_gen_moving_timeseries_forecast.py @@ -2,8 +2,8 @@ import pandas as pd import pytest +from greykite.common.gen_moving_timeseries_forecast import gen_moving_timeseries_forecast from greykite.common.testing_utils import generate_df_for_tests -from greykite.framework.benchmark.gen_moving_timeseries_forecast import gen_moving_timeseries_forecast def test_gen_moving_timeseries_forecast(): diff --git a/greykite/tests/common/test_python_utils.py b/greykite/tests/common/test_python_utils.py index 6ffc8d9..759ad96 100644 --- a/greykite/tests/common/test_python_utils.py +++ b/greykite/tests/common/test_python_utils.py @@ -20,6 +20,7 @@ from greykite.common.python_utils import ignore_warnings from greykite.common.python_utils import mutable_field from greykite.common.python_utils import reorder_columns +from greykite.common.python_utils import split_offset_str from greykite.common.python_utils import unique_dict_in_list from greykite.common.python_utils import unique_elements_in_list from greykite.common.python_utils import unique_in_list @@ -982,3 +983,11 @@ def test_group_strs_with_regex_patterns(): assert result == { "str_groups": [["sd2"], ["sd1", "sd22"]], "remainder": ["sd", "rr", "urr", "uu", "11", "12"]} + + +def test_split_offset_str(): + """Tests splitting offset strings.""" + assert split_offset_str("7D") == ["7", "D"] + assert split_offset_str("2H") == ["2", "H"] + assert split_offset_str("+1D") == ["+1", "D"] + assert split_offset_str("-500D") == ["-500", "D"] diff --git a/greykite/tests/common/test_testing_utils.py b/greykite/tests/common/test_testing_utils.py index 2996d88..179cab9 100644 --- a/greykite/tests/common/test_testing_utils.py +++ b/greykite/tests/common/test_testing_utils.py @@ -33,7 +33,7 @@ def test_generate_df_for_tests(): train_frac=0.9, remove_extra_cols=False) - assert data["df"].shape == (24*10, 52) # Contains time_feature columns + assert data["df"].shape == (24*10, 54) # Contains time_feature columns assert not data["train_df"].isna().any().any() assert not data["test_df"][TIME_COL].isna().any().any() diff --git a/greykite/tests/common/test_time_properties.py b/greykite/tests/common/test_time_properties.py index 68208a2..32d4429 100644 --- a/greykite/tests/common/test_time_properties.py +++ b/greykite/tests/common/test_time_properties.py @@ -18,6 +18,7 @@ from greykite.common.time_properties import fill_missing_dates from greykite.common.time_properties import find_missing_dates from greykite.common.time_properties import get_canonical_data +from greykite.common.time_properties import infer_freq from greykite.common.time_properties import min_gap_in_seconds @@ -366,6 +367,38 @@ def test_gcd_load_data_anomaly(): assert_equal(canonical_data_dict3["df_before_adjustment"], canonical_data_dict["df"]) assert_equal(canonical_data_dict3["df"], expected_df) + # Checks that when `anomaly_df` sets the last few datapoints in `df` to NA, + # `train_end_date` will not be affected. + value_col = "pm" + anomaly_df = pd.DataFrame({ + START_TIME_COL: ["2014-12-31-22"], + END_TIME_COL: ["2015-01-01-22"], + ADJUSTMENT_DELTA_COL: [np.nan] + }) + # Adjusts one column (value_col) + anomaly_info = { + "value_col": value_col, + "anomaly_df": anomaly_df, + "start_time_col": START_TIME_COL, + "end_time_col": END_TIME_COL, + "adjustment_delta_col": ADJUSTMENT_DELTA_COL, + "adjustment_method": "add" + } + + canonical_data_dict4 = get_canonical_data( + df=df, + time_col=TIME_COL, + value_col=value_col, + anomaly_info=anomaly_info) + + # Checks `train_end_date`. + assert canonical_data_dict4["train_end_date"] == df[TIME_COL].max() + + # Checks if the adjustment was done correctly. + values_after_adjustment = canonical_data_dict4["df"][VALUE_COL].tail(3).values + expected_after_adjustment = np.array([10, np.nan, np.nan]) + assert_equal(values_after_adjustment, expected_after_adjustment) + def test_gcd_train_end_date(): """Tests train_end_date for data without regressors""" @@ -382,31 +415,23 @@ def test_gcd_train_end_date(): with pytest.warns(UserWarning) as record: train_end_date = datetime.datetime(2018, 1, 1, 5, 0, 0) get_canonical_data(df=df) - assert f"{VALUE_COL} column of the provided TimeSeries contains null " \ - f"values at the end. Setting 'train_end_date' to the last timestamp with a " \ - f"non-null value ({train_end_date})." in record[0].message.args[0] + assert f"{VALUE_COL} column of the provided time series contains " \ + f"null values at the end, or the input `train_end_date` is beyond the last timestamp available. " \ + f"Setting `train_end_date` to the last timestamp with a non-null value " \ + f"({train_end_date})." in record[0].message.args[0] # train_end_date later than last date in df with pytest.warns(UserWarning) as record: train_end_date = datetime.datetime(2018, 1, 1, 8, 0, 0) - result_train_end_date = datetime.datetime(2018, 1, 1, 5, 0, 0) get_canonical_data(df, train_end_date=train_end_date) - assert f"Input timestamp for the parameter 'train_end_date' " \ - f"({train_end_date}) either exceeds the last available timestamp or" \ - f"{VALUE_COL} column of the provided TimeSeries contains null " \ - f"values at the end. Setting 'train_end_date' to the last timestamp with a " \ - f"non-null value ({result_train_end_date})." in record[0].message.args[0] + assert f"{VALUE_COL} column of the provided time series contains null values at the end, " \ + f"or the input `train_end_date` is beyond the last timestamp available." in record[0].message.args[0] # train_end_date in between last date in df and last date before with pytest.warns(UserWarning) as record: train_end_date = datetime.datetime(2018, 1, 1, 6, 0, 0) - result_train_end_date = datetime.datetime(2018, 1, 1, 5, 0, 0) get_canonical_data(df, train_end_date=train_end_date) - assert f"Input timestamp for the parameter 'train_end_date' " \ - f"({train_end_date}) either exceeds the last available timestamp or" \ - f"{VALUE_COL} column of the provided TimeSeries contains null " \ - f"values at the end. Setting 'train_end_date' to the last timestamp with a " \ - f"non-null value ({result_train_end_date})." in record[0].message.args[0] + assert f"{VALUE_COL} column of the provided time series contains trailing NAs" in record[0].message.args[0] # train end date equal to last date before null canonical_data_dict = get_canonical_data(df, train_end_date=datetime.datetime(2018, 1, 1, 5, 0, 0)) @@ -448,9 +473,10 @@ def test_gcd_train_end_date_regressor(): df=df, train_end_date=None, regressor_cols=None) - assert f"{VALUE_COL} column of the provided TimeSeries contains null " \ - f"values at the end. Setting 'train_end_date' to the last timestamp with a " \ - f"non-null value ({result_train_end_date})." in record[0].message.args[0] + assert f"{VALUE_COL} column of the provided time series contains " \ + f"null values at the end, or the input `train_end_date` is beyond the last timestamp available. " \ + f"Setting `train_end_date` to the last timestamp with a non-null value " \ + f"({result_train_end_date})." in record[0].message.args[0] assert canonical_data_dict["df"].shape == df.shape assert canonical_data_dict["fit_df"].shape == (22, 2) assert canonical_data_dict["regressor_cols"] == [] @@ -466,15 +492,12 @@ def test_gcd_train_end_date_regressor(): df=df, train_end_date=train_end_date, regressor_cols=regressor_cols) - assert f"Input timestamp for the parameter 'train_end_date' " \ - f"({train_end_date}) either exceeds the last available timestamp or" \ - f"{VALUE_COL} column of the provided TimeSeries contains null " \ - f"values at the end. Setting 'train_end_date' to the last timestamp with a " \ - f"non-null value ({result_train_end_date})." in record[0].message.args[0] + assert f"{VALUE_COL} column of the provided time series contains null values at the end, " \ + f"or the input `train_end_date` is beyond the last timestamp available." in record[0].message.args[0] assert canonical_data_dict["fit_df"].shape == (22, 5) assert canonical_data_dict["regressor_cols"] == regressor_cols assert canonical_data_dict["fit_cols"] == [TIME_COL, VALUE_COL] + regressor_cols - assert canonical_data_dict["train_end_date"] == result_train_end_date + assert canonical_data_dict["train_end_date"] == datetime.datetime(2018, 1, 22) assert canonical_data_dict["last_date_for_val"] == datetime.datetime(2018, 1, 22) assert canonical_data_dict["last_date_for_reg"] == datetime.datetime(2018, 1, 28) @@ -486,15 +509,11 @@ def test_gcd_train_end_date_regressor(): df=df, train_end_date=train_end_date, regressor_cols=None) - assert f"Input timestamp for the parameter 'train_end_date' " \ - f"({train_end_date}) either exceeds the last available timestamp or" \ - f"{VALUE_COL} column of the provided TimeSeries contains null " \ - f"values at the end. Setting 'train_end_date' to the last timestamp with a " \ - f"non-null value ({result_train_end_date})." in record[0].message.args[0] - assert canonical_data_dict["fit_df"].shape == (22, 2) + assert f"{VALUE_COL} column of the provided time series contains trailing NAs" in record[0].message.args[0] + assert canonical_data_dict["fit_df"].shape == (25, 2) assert canonical_data_dict["regressor_cols"] == [] assert canonical_data_dict["fit_cols"] == [TIME_COL, VALUE_COL] - assert canonical_data_dict["train_end_date"] == datetime.datetime(2018, 1, 22) + assert canonical_data_dict["train_end_date"] == train_end_date # `train_end_date` is unchanged when provided. assert canonical_data_dict["last_date_for_val"] == datetime.datetime(2018, 1, 22) assert canonical_data_dict["last_date_for_reg"] is None @@ -530,3 +549,37 @@ def test_gcd_train_end_date_regressor(): assert canonical_data_dict["last_date_for_reg"] == datetime.datetime(2018, 1, 28) assert (f"The following columns are not available to use as " f"regressors: ['regressor4', 'regressor5']") in record[0].message.args[0] + + +def test_infer_freq(): + freq_list = ["B", "W", "W-SAT", "W-TUE", "M", + "MS", "SM", "SMS", "CBMS", "BM", "B", "Q", + "QS", "BQS", "BQ-NOV", "Y", "A-JAN", "YS", + "AS-SEP", "BH", "T", "S"] + + for freq in freq_list: + date_list = pd.date_range( + start="2020-06-10", + periods=100, + freq=freq).tolist() + df = pd.DataFrame({TIME_COL: date_list}) + + # With complete timestamps + pandas_freq_complete = pd.infer_freq(df[TIME_COL]) + inferred_freq_complete = infer_freq(df, TIME_COL) + + # With NaN timestamps + df[TIME_COL][5:8] = np.nan + df[TIME_COL][32:40] = None + pandas_freq_nan = pd.infer_freq(df[TIME_COL]) + inferred_freq_nan = infer_freq(df, TIME_COL) + + # With missing timestamps + df = df[df[TIME_COL].notna()] + pandas_freq_missing = pd.infer_freq(df[TIME_COL]) + inferred_freq_missing = infer_freq(df, TIME_COL) + + print(freq, pandas_freq_complete, inferred_freq_complete, inferred_freq_nan) + + assert pandas_freq_nan == pandas_freq_missing is None + assert pandas_freq_complete == inferred_freq_complete == inferred_freq_missing == inferred_freq_complete diff --git a/greykite/tests/common/viz/test_colors_utils.py b/greykite/tests/common/viz/test_colors_utils.py index 136ecc8..824e5eb 100644 --- a/greykite/tests/common/viz/test_colors_utils.py +++ b/greykite/tests/common/viz/test_colors_utils.py @@ -1,7 +1,9 @@ +import pytest from plotly.colors import DEFAULT_PLOTLY_COLORS from greykite.common.testing_utils import assert_equal from greykite.common.viz.colors_utils import get_color_palette +from greykite.common.viz.colors_utils import get_distinct_colors def test_get_color_palette(): @@ -23,3 +25,92 @@ def test_get_color_palette(): colors = ["rgb(99, 114, 218)", "rgb(0, 145, 202)", "rgb(255, 255, 255)"] color_palette = get_color_palette(3, colors=colors) assert_equal(color_palette, colors) + + +def test_get_distinct_colors(): + """Tests the function to get most distinguishable colors.""" + # Under 10 colors, using tab10. + assert get_distinct_colors(num_colors=1) == get_distinct_colors(num_colors=2)[:1] + + assert get_distinct_colors(num_colors=10) == [ + "rgba(31, 119, 180, 0.95)", + "rgba(255, 127, 14, 0.95)", + "rgba(44, 160, 44, 0.95)", + "rgba(214, 39, 40, 0.95)", + "rgba(148, 103, 189, 0.95)", + "rgba(140, 86, 75, 0.95)", + "rgba(227, 119, 194, 0.95)", + "rgba(127, 127, 127, 0.95)", + "rgba(188, 189, 34, 0.95)", + "rgba(23, 190, 207, 0.95)" + ] + # Under 20 colors, using tab20. + assert get_distinct_colors(num_colors=15) == get_distinct_colors(num_colors=18)[:15] + + assert get_distinct_colors(num_colors=20, opacity=0.9) == [ + "rgba(31, 119, 180, 0.9)", + "rgba(174, 199, 232, 0.9)", + "rgba(255, 127, 14, 0.9)", + "rgba(255, 187, 120, 0.9)", + "rgba(44, 160, 44, 0.9)", + "rgba(152, 223, 138, 0.9)", + "rgba(214, 39, 40, 0.9)", + "rgba(255, 152, 150, 0.9)", + "rgba(148, 103, 189, 0.9)", + "rgba(197, 176, 213, 0.9)", + "rgba(140, 86, 75, 0.9)", + "rgba(196, 156, 148, 0.9)", + "rgba(227, 119, 194, 0.9)", + "rgba(247, 182, 210, 0.9)", + "rgba(127, 127, 127, 0.9)", + "rgba(199, 199, 199, 0.9)", + "rgba(188, 189, 34, 0.9)", + "rgba(219, 219, 141, 0.9)", + "rgba(23, 190, 207, 0.9)", + "rgba(158, 218, 229, 0.9)" + ] + + # Under 256 colors, using Viridis. + assert get_distinct_colors(num_colors=30, opacity=0.85) == [ + "rgba(68, 1, 84, 0.85)", + "rgba(70, 12, 95, 0.85)", + "rgba(72, 25, 107, 0.85)", + "rgba(71, 37, 117, 0.85)", + "rgba(70, 48, 125, 0.85)", + "rgba(67, 59, 131, 0.85)", + "rgba(64, 68, 135, 0.85)", + "rgba(60, 78, 138, 0.85)", + "rgba(55, 88, 140, 0.85)", + "rgba(51, 97, 141, 0.85)", + "rgba(47, 106, 141, 0.85)", + "rgba(43, 115, 142, 0.85)", + "rgba(40, 122, 142, 0.85)", + "rgba(37, 131, 141, 0.85)", + "rgba(34, 139, 141, 0.85)", + "rgba(31, 148, 139, 0.85)", + "rgba(30, 156, 137, 0.85)", + "rgba(32, 165, 133, 0.85)", + "rgba(38, 172, 129, 0.85)", + "rgba(48, 180, 122, 0.85)", + "rgba(62, 188, 115, 0.85)", + "rgba(79, 195, 105, 0.85)", + "rgba(98, 202, 95, 0.85)", + "rgba(119, 208, 82, 0.85)", + "rgba(139, 213, 70, 0.85)", + "rgba(162, 218, 55, 0.85)", + "rgba(186, 222, 39, 0.85)", + "rgba(210, 225, 27, 0.85)", + "rgba(233, 228, 25, 0.85)", + "rgba(253, 231, 36, 0.85)" + ] + # Can't get more than 256 colors. + with pytest.raises( + ValueError, + match="The maximum number of colors is 256."): + get_distinct_colors(num_colors=257) + + # Opacity must be between 0 and 1 + with pytest.raises( + ValueError, + match="Opacity must be between 0 and 1."): + get_distinct_colors(num_colors=2, opacity=-1) diff --git a/greykite/tests/common/viz/test_timeseries_annotate.py b/greykite/tests/common/viz/test_timeseries_annotate.py index 2d08282..732c52b 100644 --- a/greykite/tests/common/viz/test_timeseries_annotate.py +++ b/greykite/tests/common/viz/test_timeseries_annotate.py @@ -1,6 +1,17 @@ +import numpy as np import pandas as pd import pytest +from greykite.common.constants import END_TIME_COL +from greykite.common.constants import START_TIME_COL +from greykite.common.python_utils import assert_equal +from greykite.common.viz.colors_utils import get_distinct_colors +from greykite.common.viz.timeseries_annotate import add_multi_vrects +from greykite.common.viz.timeseries_annotate import plot_anomalies_over_forecast_vs_actual +from greykite.common.viz.timeseries_annotate import plot_event_periods_multi +from greykite.common.viz.timeseries_annotate import plot_lines_markers +from greykite.common.viz.timeseries_annotate import plot_overlay_anomalies_multi_metric +from greykite.common.viz.timeseries_annotate import plot_precision_recall_curve from greykite.common.viz.timeseries_annotate import plt_annotate_series from greykite.common.viz.timeseries_annotate import plt_compare_series_annotations @@ -169,3 +180,553 @@ def test_fill(): forecast_col=None, fill_cols=["forecast_lower", "forecast_upper"], standardize_col="actual") + + +def test_plot_lines_markers(): + """Tests ``plot_lines_markers``.""" + df = pd.DataFrame({ + "ts": [1, 2, 3, 4, 5, 6], + "line1": [3, 4, 5, 6, 7, 7], + "line2": [4, 5, 6, 7, 8, 8], + "marker1": [np.nan, np.nan, 5, 6, np.nan, np.nan], + "marker2": [np.nan, 5, 6, np.nan, np.nan, 8], + }) + + fig = plot_lines_markers( + df=df, + x_col="ts", + line_cols=["line1", "line2"], + marker_cols=["marker1", "marker2"], + line_colors=None, + marker_colors=None) + + assert len(fig.data) == 4 + assert fig.data[0].line.color is None + assert fig.data[1].line.color is None + assert fig.data[2].marker.color is None + assert fig.data[3].marker.color is None + + # Next we make the marker and line colors consistent + marker_colors = get_distinct_colors( + num_colors=2, + opacity=1.0) + + line_colors = get_distinct_colors( + num_colors=2, + opacity=0.5) + + fig = plot_lines_markers( + df=df, + x_col="ts", + line_cols=["line1", "line2"], + marker_cols=["marker1", "marker2"], + line_colors=line_colors, + marker_colors=marker_colors) + + assert len(fig.data) == 4 + assert fig.data[0].line.color == "rgba(31, 119, 180, 0.5)" + assert fig.data[1].line.color == "rgba(255, 127, 14, 0.5)" + assert fig.data[2].marker.color == "rgba(31, 119, 180, 1.0)" + assert fig.data[3].marker.color == "rgba(255, 127, 14, 1.0)" + + # Length of ``line_cols`` must be the same as ``line_cols`` if passed. + with pytest.raises( + ValueError, + match="If `line_colors` is passed, its length must be equal to `line_cols`"): + plot_lines_markers( + df=df, + x_col="ts", + line_cols=["line1", "line2"], + marker_cols=["marker1", "marker2"], + line_colors=line_colors[:1], + marker_colors=marker_colors) + + # At least one of ``line_cols`` or ``marker_cols`` must be provided (not None). + with pytest.raises( + ValueError, + match="At least one of"): + plot_lines_markers( + df=df, + x_col="ts", + line_cols=None, + marker_cols=None, + line_colors=None, + marker_colors=None) + + +def test_plot_event_periods_multi(): + """Tests ``plot_event_periods_multi`` function.""" + df = pd.DataFrame({ + "start_time": ["2020-01-01", "2020-02-01", "2020-01-02", "2020-02-02", "2020-02-05"], + "end_time": ["2020-01-03", "2020-02-04", "2020-01-05", "2020-02-06", "2020-02-08"], + "metric": ["impressions", "impressions", "clicks", "clicks", "bookings"] + }) + + # ``grouping_col`` is not passed. + res = plot_event_periods_multi( + periods_df=df, + start_time_col="start_time", + end_time_col="end_time", + freq=None, + grouping_col=None, + min_timestamp=None, + max_timestamp=None) + + fig = res["fig"] + labels_df = res["labels_df"] + ts = res["ts"] + min_timestamp = res["min_timestamp"] + max_timestamp = res["max_timestamp"] + new_cols = res["new_cols"] + + assert labels_df.shape == (913, 2) + assert list(labels_df.columns) == ["ts", "metric_is_anomaly"] + assert len(fig.data) == 1 + assert len(fig.data[0]["x"]) == 913 + assert len(ts) == 913 + assert min_timestamp == "2020-01-01" + assert max_timestamp == "2020-02-08" + assert new_cols == ["metric_is_anomaly"] + + # ``grouping_col`` is passed. + res = plot_event_periods_multi( + periods_df=df, + start_time_col="start_time", + end_time_col="end_time", + freq=None, + grouping_col="metric", + min_timestamp=None, + max_timestamp=None) + + fig = res["fig"] + labels_df = res["labels_df"] + ts = res["ts"] + min_timestamp = res["min_timestamp"] + max_timestamp = res["max_timestamp"] + new_cols = res["new_cols"] + + assert labels_df.shape == (913, 4) + assert set(labels_df.columns) == { + "ts", "bookings_is_anomaly", "impressions_is_anomaly", "clicks_is_anomaly"} + assert len(fig.data) == 3 + assert len(fig.data[0]["x"]) == 913 + assert len(ts) == 913 + assert min_timestamp == "2020-01-01" + assert max_timestamp == "2020-02-08" + assert set(new_cols) == {"bookings_is_anomaly", "impressions_is_anomaly", "clicks_is_anomaly"} + + # Specifies ``freq`` + res = plot_event_periods_multi( + periods_df=df, + start_time_col="start_time", + end_time_col="end_time", + freq="min", + grouping_col="metric", + min_timestamp=None, + max_timestamp=None) + + fig = res["fig"] + labels_df = res["labels_df"] + ts = res["ts"] + min_timestamp = res["min_timestamp"] + max_timestamp = res["max_timestamp"] + new_cols = res["new_cols"] + + assert labels_df.shape == (54721, 4) + assert set(labels_df.columns) == { + "ts", "bookings_is_anomaly", "impressions_is_anomaly", "clicks_is_anomaly"} + assert len(fig.data) == 3 + assert len(fig.data[0]["x"]) == 54721 + assert len(ts) == 54721 + assert min_timestamp == "2020-01-01" + assert max_timestamp == "2020-02-08" + assert set(new_cols) == {"bookings_is_anomaly", "impressions_is_anomaly", "clicks_is_anomaly"} + + # ``min_timestamp``, ``max_timestamp`` specified. + res = plot_event_periods_multi( + periods_df=df, + start_time_col="start_time", + end_time_col="end_time", + freq=None, + grouping_col="metric", + min_timestamp="2019-12-15", + max_timestamp="2020-02-15") + + fig = res["fig"] + labels_df = res["labels_df"] + ts = res["ts"] + min_timestamp = res["min_timestamp"] + max_timestamp = res["max_timestamp"] + new_cols = res["new_cols"] + + assert labels_df.shape == (1489, 4) + assert set(labels_df.columns) == { + "ts", "bookings_is_anomaly", "impressions_is_anomaly", "clicks_is_anomaly"} + assert len(fig.data) == 3 + assert len(fig.data[0]["x"]) == 1489 + assert len(ts) == 1489 + assert min_timestamp == "2019-12-15" + assert max_timestamp == "2020-02-15" + assert set(new_cols) == {"bookings_is_anomaly", "impressions_is_anomaly", "clicks_is_anomaly"} + + # Tests for raising ``ValueError`` due to start time being larger than end time. + df = pd.DataFrame({ + "start_time": ["2020-01-03"], + "end_time": ["2020-01-02"], + "metric": ["impressions"] + }) + + expected_match = "End Time:" + with pytest.raises(ValueError, match=expected_match): + plot_event_periods_multi( + periods_df=df, + start_time_col="start_time", + end_time_col="end_time", + freq=None, + grouping_col=None, + min_timestamp=None, + max_timestamp=None) + + +def test_add_multi_vrects(): + """Tests ``add_multi_vrects``.""" + periods_df = pd.DataFrame({ + "start_time": ["2019-12-28", "2020-01-25", "2020-01-02", "2020-02-02", "2020-02-05"], + "end_time": ["2020-01-03", "2020-02-04", "2020-01-05", "2020-02-06", "2020-02-08"], + "metric": ["impressions", "impressions", "clicks", "clicks", "bookings"], + "reason": ["C1", "C2", "GCN1", "GCN2", "Lockdown"] + }) + + ts = pd.date_range(start="2019-12-15", end="2020-02-15", freq="D") + df = pd.DataFrame({ + "ts": ts, + "y1": range(len(ts))}) + + df["y2"] = (df["y1"] - 10)**(1.5) + + fig = plot_lines_markers( + df=df, + x_col="ts", + line_cols=["y1", "y2"], + marker_cols=None, + line_colors=None, + marker_colors=None) + + # With multiple groups + res = add_multi_vrects( + periods_df=periods_df, + fig=fig, + start_time_col="start_time", + end_time_col="end_time", + grouping_col="metric", + y_min=-15, + y_max=df["y1"].max(), + annotation_text_col="reason", + grouping_color_dict=None) + + fig = res["fig"] + grouping_color_dict = res["grouping_color_dict"] + assert grouping_color_dict == { + "bookings": "rgba(31, 119, 180, 1.0)", + "clicks": "rgba(255, 127, 14, 1.0)", + "impressions": "rgba(44, 160, 44, 1.0)"} + assert len(fig.data) == 2 + assert len(fig.layout.shapes) == 5 + + # No groups + fig = plot_lines_markers( + df=df, + x_col="ts", + line_cols=["y1", "y2"], + marker_cols=None, + line_colors=None, + marker_colors=None) + + res = add_multi_vrects( + periods_df=periods_df, + fig=fig, + start_time_col="start_time", + end_time_col="end_time", + grouping_col=None, + y_min=-15, + y_max=df["y1"].max(), + annotation_text_col="reason", + opacity=0.4, + grouping_color_dict=None) + + fig = res["fig"] + grouping_color_dict = res["grouping_color_dict"] + assert grouping_color_dict == {"metric": "rgba(31, 119, 180, 1.0)"} + assert len(fig.data) == 2 + assert len(fig.layout.shapes) == 5 + + # No groups, no text annotations + fig = plot_lines_markers( + df=df, + x_col="ts", + line_cols=["y1", "y2"], + marker_cols=None, + line_colors=None, + marker_colors=None) + + res = add_multi_vrects( + periods_df=periods_df, + fig=fig, + start_time_col="start_time", + end_time_col="end_time", + grouping_col=None, + y_min=-15, + y_max=df["y1"].max(), + annotation_text_col=None, + opacity=0.4, + grouping_color_dict=None) + + fig = res["fig"] + grouping_color_dict = res["grouping_color_dict"] + assert grouping_color_dict == {"metric": "rgba(31, 119, 180, 1.0)"} + assert len(fig.data) == 2 + assert len(fig.layout.shapes) == 5 + + # Tests for raising ``ValueError`` due to non-existing column name + expected_match = "start_time_col" + with pytest.raises(ValueError, match=expected_match): + add_multi_vrects( + periods_df=periods_df, + fig=fig, + start_time_col="start_timestamp", # This column does not exist + end_time_col="end_time", + grouping_col=None, + y_min=-15, + y_max=df["y1"].max(), + annotation_text_col=None, + opacity=0.4, + grouping_color_dict=None) + + +def test_plot_overlay_anomalies_multi_metric(): + """Tests ``plot_overlay_anomalies_multi_metric``.""" + anomaly_df = pd.DataFrame({ + "start_time": ["2020-01-01", "2020-02-01", "2020-01-02", "2020-02-02", "2020-02-05"], + "end_time": ["2020-01-03", "2020-02-04", "2020-01-05", "2020-02-06", "2020-02-08"], + "metric": ["impressions", "impressions", "clicks", "clicks", "bookings"], + "reason": ["C1", "C2", "GCN1", "GCN2", "Lockdown"] + }) + + ts = pd.date_range(start="2019-12-01", end="2020-03-01", freq="D") + + np.random.seed(1317) + df = pd.DataFrame({"ts": ts}) + size = len(df) + value_cols = ["impressions", "clicks", "bookings"] + df["impressions"] = np.random.normal(loc=0.0, scale=1.0, size=size) + df["clicks"] = np.random.normal(loc=1.0, scale=1.0, size=size) + df["bookings"] = np.random.normal(loc=2.0, scale=1.0, size=size) + + # Without annotation texts + res = plot_overlay_anomalies_multi_metric( + df=df, + time_col="ts", + value_cols=value_cols, + anomaly_df=anomaly_df, + anomaly_df_grouping_col="metric", + start_time_col=START_TIME_COL, + end_time_col=END_TIME_COL) + + fig = res["fig"] + augmented_df = res["augmented_df"] + is_anomaly_cols = res["is_anomaly_cols"] + line_colors = res["line_colors"] + + assert len(fig.data) == 6 + assert augmented_df.shape == (92, 13) + + assert set(is_anomaly_cols) == { + "impressions_is_anomaly", + "clicks_is_anomaly", + "bookings_is_anomaly"} + + assert line_colors == [ + "rgba(31, 119, 180, 0.6)", + "rgba(255, 127, 14, 0.6)", + "rgba(44, 160, 44, 0.6)"] + + # With annotation texts + ts = pd.date_range(start="2019-12-25", end="2020-02-15", freq="D") + + np.random.seed(1317) + df = pd.DataFrame({"ts": ts}) + size = len(df) + value_cols = ["impressions", "clicks", "bookings"] + df["impressions"] = np.random.normal(loc=0.0, scale=1.0, size=size) + df["clicks"] = np.random.normal(loc=1.0, scale=1.0, size=size) + df["bookings"] = np.random.normal(loc=2.0, scale=1.0, size=size) + + res = plot_overlay_anomalies_multi_metric( + df=df, + time_col="ts", + value_cols=value_cols, + anomaly_df=anomaly_df, + anomaly_df_grouping_col="metric", + start_time_col=START_TIME_COL, + end_time_col=END_TIME_COL, + annotation_text_col="reason") + + fig = res["fig"] + augmented_df = res["augmented_df"] + is_anomaly_cols = res["is_anomaly_cols"] + line_colors = res["line_colors"] + + assert len(fig.data) == 6 + assert augmented_df.shape == (53, 13) + + assert set(is_anomaly_cols) == { + "impressions_is_anomaly", + "clicks_is_anomaly", + "bookings_is_anomaly"} + + assert line_colors == [ + "rgba(31, 119, 180, 0.6)", + "rgba(255, 127, 14, 0.6)", + "rgba(44, 160, 44, 0.6)"] + + +def test_plot_precision_recall_curve(): + """Tests ``plot_precision_recall_curve``.""" + # Creates fake data. + precision = np.linspace(0, 1, num=25) + recall = np.linspace(0, 0.5, num=25) + pr_df = pd.DataFrame({ + "precision": precision, + "recall": recall, + "key": 1}) + groups_df = pd.DataFrame({"groups": ["A", "B", "C"], "key": 1}) + df = pr_df.merge(groups_df, how="inner", on="key").reset_index(drop=True).drop("key", 1) + # Generates the precision recall curve. + fig = plot_precision_recall_curve( + df=df, + grouping_col="groups", + precision_col="precision", + recall_col="recall") + # Asserts one curve per group. + assert len(fig.data) == groups_df.shape[0] + # Checks that the data is correct. + for index, row in groups_df.iterrows(): + assert_equal(np.array(fig.data[index]["x"]), recall) + assert_equal(np.array(fig.data[index]["y"]), precision) + assert_equal(np.array(fig.data[index]["name"]), row["groups"]) + + # Generates the precision recall curve when `grouping_col` is None. + fig = plot_precision_recall_curve( + df=pr_df, + grouping_col=None, + precision_col="precision", + recall_col="recall") + # Asserts only one curve. + assert len(fig.data) == 1 + # Checks that the data is correct. + assert_equal(np.array(fig.data[0]["x"]), recall) + assert_equal(np.array(fig.data[0]["y"]), precision) + + # Tests expected errors. + expected_match = "must contain" + with pytest.raises(ValueError, match=expected_match): + plot_precision_recall_curve( + df=df, + grouping_col=None, + precision_col="wrong_column", + recall_col="recall") + + expected_match = "must contain" + with pytest.raises(ValueError, match=expected_match): + plot_precision_recall_curve( + df=df, + grouping_col=None, + precision_col="precision", + recall_col="wrong_column") + + expected_match = "is not found" + with pytest.raises(ValueError, match=expected_match): + plot_precision_recall_curve( + df=df, + grouping_col="wrong_column", + precision_col="precision", + recall_col="recall") + + +def test_plot_anomalies_over_forecast_vs_actual(): + """Tests ``plot_anomalies_over_forecast_vs_actual`` function.""" + size = 200 + num_anomalies = 10 + num_predicted_anomalies = 15 + df = pd.DataFrame({ + "ts": pd.date_range(start="2018-01-01", periods=size, freq="H"), + "actual": np.random.normal(scale=10, size=size) + }) + df["forecast"] = df["actual"] + np.random.normal(size=size) + df["forecast_lower"] = df["forecast"] - 10 + df["forecast_upper"] = df["forecast"] + 10 + df["is_anomaly"] = False + df["is_anomaly_predicted"] = False + df.loc[df.sample(num_anomalies).index, "is_anomaly"] = True + df.loc[df.sample(num_predicted_anomalies).index, "is_anomaly_predicted"] = True + + # Tests plots when both `predicted_anomaly_col` and `anomaly_col` are not None. + fig = plot_anomalies_over_forecast_vs_actual( + df=df, + time_col="ts", + actual_col="actual", + predicted_col="forecast", + predicted_lower_col="forecast_lower", + predicted_upper_col="forecast_upper", + predicted_anomaly_col="is_anomaly_predicted", + anomaly_col="is_anomaly", + predicted_anomaly_marker_color="lightblue", + anomaly_marker_color="orange", + marker_opacity=0.4) + assert len(fig.data) == 6 + # Checks the predicted anomaly data is correct. + fig_predicted_anomaly_data = [data for data in fig.data if data["name"] == "is_anomaly_predicted".title()][0] + assert len(fig_predicted_anomaly_data["x"]) == num_predicted_anomalies + assert fig_predicted_anomaly_data["marker"]["color"] == "lightblue" + assert fig_predicted_anomaly_data["opacity"] == 0.4 + # Checks the anomaly data is correct. + fig_anomaly_data = [data for data in fig.data if data["name"] == "is_anomaly".title()][0] + assert len(fig_anomaly_data["x"]) == num_anomalies + assert fig_anomaly_data["marker"]["color"] == "orange" + assert fig_anomaly_data["opacity"] == 0.4 + + # Tests plots when `predicted_anomaly_col` is None. + fig = plot_anomalies_over_forecast_vs_actual( + df=df, + time_col="ts", + actual_col="actual", + predicted_col="forecast", + predicted_lower_col="forecast_lower", + predicted_upper_col="forecast_upper", + predicted_anomaly_col=None, + anomaly_col="is_anomaly") + assert len(fig.data) == 5 + + # Tests plots when `anomaly_col` is None. + fig = plot_anomalies_over_forecast_vs_actual( + df=df, + time_col="ts", + actual_col="actual", + predicted_col="forecast", + predicted_lower_col="forecast_lower", + predicted_upper_col="forecast_upper", + predicted_anomaly_col="is_anomaly_predicted", + anomaly_col=None) + assert len(fig.data) == 5 + + # Tests plots when both `predicted_anomaly_col` and `anomaly_col` are None. + fig = plot_anomalies_over_forecast_vs_actual( + df=df, + time_col="ts", + actual_col="actual", + predicted_col="forecast", + predicted_lower_col="forecast_lower", + predicted_upper_col="forecast_upper", + predicted_anomaly_col=None, + anomaly_col=None) + assert len(fig.data) == 4 diff --git a/greykite/tests/common/viz/test_timeseries_plotting.py b/greykite/tests/common/viz/test_timeseries_plotting.py index 61ecca2..c0be896 100644 --- a/greykite/tests/common/viz/test_timeseries_plotting.py +++ b/greykite/tests/common/viz/test_timeseries_plotting.py @@ -5,7 +5,7 @@ import numpy as np import pandas as pd import pytest -from pandas.util.testing import assert_frame_equal +from pandas.testing import assert_frame_equal from plotly.colors import DEFAULT_PLOTLY_COLORS from testfixtures import LogCapture @@ -14,6 +14,7 @@ from greykite.common.viz.timeseries_plotting import add_groupby_column from greykite.common.viz.timeseries_plotting import flexible_grouping_evaluation from greykite.common.viz.timeseries_plotting import grouping_evaluation +from greykite.common.viz.timeseries_plotting import plot_dual_axis_figure from greykite.common.viz.timeseries_plotting import plot_forecast_vs_actual from greykite.common.viz.timeseries_plotting import plot_multivariate from greykite.common.viz.timeseries_plotting import plot_multivariate_grouped @@ -723,6 +724,7 @@ def test_flexible_grouping_evaluation(): groupby_col="groups", agg_kwargs={"func": agg_list} ) + eval_df = eval_df[[col for col in eval_df.columns if "ts" not in col]] agg_dict = { # equivalent dictionary specification "actual": [np.nanmedian, np.nanmean], "forecast": [np.nanmedian, np.nanmean], @@ -825,3 +827,91 @@ def test_flexible_grouping_evaluation(): agg_kwargs={"func": "mean"}, extend_col_names=True) assert list(eval_df.columns) == ["actual", "forecast"] + + +def test_plot_dual_axis_figure(): + """Tests `plot_dual_axis_figure` function""" + + # Generates fake data. + num_points = 50 + x_col = "x" + y_left_col = "y_left" + y_right_col = "y_right" + grouping_col = "groups" + df = pd.DataFrame({ + x_col: np.linspace(0, 1, num=num_points), + y_left_col: np.linspace(1, 100, num=num_points), + y_right_col: np.random.rand(num_points), + grouping_col: ["A"]*(num_points // 2) + ["B"]*(num_points // 2)}) + # Plot params. + xlabel = "apples" + ylabel_left = "oranges" + ylabel_right = "banana" + title = "ys vs x" + + # Tests when `grouping_col` is specified. + fig = plot_dual_axis_figure( + df=df, + x_col=x_col, + y_left_col=y_left_col, + y_right_col=y_right_col, + xlabel=xlabel, + ylabel_left=ylabel_left, + ylabel_right=ylabel_right, + title=title, + grouping_col=grouping_col, + y_left_linestyle="dash", + y_right_linestyle="dot") + assert len(fig.data) == 4 + assert fig.layout.xaxis.title.text == xlabel + assert fig.layout.yaxis.title.text == ylabel_left + assert fig.layout.yaxis2.title.text == ylabel_right + assert fig.data[0].line.dash == "dash" + assert fig.data[1].line.dash == "dot" + assert fig.data[2].line.dash == "dash" + assert fig.data[3].line.dash == "dot" + + # Tests when `group_color_dict` is specified. + group_color_dict = {"A": "blue", "B": "green"} + fig = plot_dual_axis_figure( + df=df, + x_col=x_col, + y_left_col=y_left_col, + y_right_col=y_right_col, + grouping_col=grouping_col, + group_color_dict=group_color_dict) + for level, level_color in group_color_dict.items(): + assert all([curve.line.color == level_color for curve in fig.data if curve.legendgroup == f"{grouping_col} = {level}"]) + + # Tests when `grouping_col` is None. + fig = plot_dual_axis_figure( + df=df, + x_col=x_col, + y_left_col=y_left_col, + y_right_col=y_right_col, + xlabel=xlabel, + ylabel_left=ylabel_left, + ylabel_right=ylabel_right, + title=title, + grouping_col=None, + y_left_linestyle="longdashdot", + y_right_linestyle="solid") + assert len(fig.data) == 2 + assert fig.layout.xaxis.title.text == xlabel + assert fig.layout.yaxis.title.text == ylabel_left + assert fig.layout.yaxis2.title.text == ylabel_right + assert fig.data[0].line.dash == "longdashdot" + assert fig.data[1].line.dash == "solid" + + # Test for correct data input. + with pytest.raises(ValueError, match=r"must contain the columns"): + plot_dual_axis_figure( + df=df, + x_col="x_col_wrong", + y_left_col=y_left_col, + y_right_col=y_right_col, + xlabel=xlabel, + ylabel_left=ylabel_left, + ylabel_right=ylabel_right, + title=title, + grouping_col=grouping_col) diff --git a/greykite/tests/framework/input/test_univariate_time_series.py b/greykite/tests/framework/input/test_univariate_time_series.py index 6265c25..51ebd5e 100644 --- a/greykite/tests/framework/input/test_univariate_time_series.py +++ b/greykite/tests/framework/input/test_univariate_time_series.py @@ -197,9 +197,10 @@ def test_make_future_dataframe(): }) with pytest.warns(UserWarning) as record: ts.load_data(df, TIME_COL, VALUE_COL, regressor_cols=None) - assert f"{ts.original_value_col} column of the provided TimeSeries contains null " \ - f"values at the end. Setting 'train_end_date' to the last timestamp with a " \ - f"non-null value ({ts.train_end_date})." in record[0].message.args[0] + assert f"{ts.original_value_col} column of the provided time series contains " \ + f"null values at the end, or the input `train_end_date` is beyond the last timestamp available. " \ + f"Setting `train_end_date` to the last timestamp with a non-null value " \ + f"({ts.train_end_date})." in record[0].message.args[0] # test regressor_cols from load_data assert ts.regressor_cols == [] @@ -272,7 +273,7 @@ def test_make_future_dataframe_with_regressor(): with pytest.warns(Warning) as record: ts.load_data(df, TIME_COL, VALUE_COL, regressor_cols=regressor_cols) - assert "y column of the provided TimeSeries contains null " \ + assert "y column of the provided time series contains null " \ "values at the end" in record[0].message.args[0] # test regressor_cols from load_data @@ -373,163 +374,164 @@ def test_train_end_date_without_regressors(): # train_end_date later than last date in df with pytest.warns(UserWarning) as record: - train_end_date = dt(2018, 1, 1, 8, 0, 0) - ts.load_data(df, TIME_COL, VALUE_COL, train_end_date=train_end_date) - assert f"Input timestamp for the parameter 'train_end_date' " \ - f"({train_end_date}) either exceeds the last available timestamp or" \ - f"{VALUE_COL} column of the provided TimeSeries contains null " \ - f"values at the end. Setting 'train_end_date' to the last timestamp with a " \ - f"non-null value ({ts.train_end_date})." in record[0].message.args[0] - assert ts.train_end_date == dt(2018, 1, 1, 5, 0, 0) - result = ts.make_future_dataframe( - periods=10, - include_history=True) - expected = pd.DataFrame({ - TIME_COL: pd.date_range( - start=dt(2018, 1, 1, 3, 0, 0), - periods=13, - freq="H"), - VALUE_COL: np.concatenate((ts.fit_y, np.repeat(np.nan, 10))) - }) - expected.index = expected[TIME_COL] - expected.index.name = None - assert_frame_equal(result, expected) + for train_end_date in ["2018/1/1/7"]: + date_format = "%Y/%m/%d/%H" + ts.load_data( + df, + TIME_COL, + VALUE_COL, + train_end_date=train_end_date, + date_format=date_format + ) + assert f"{VALUE_COL} column of the provided time series contains trailing NAs." in record[0].message.args[0] + assert ts.train_end_date == dt(2018, 1, 1, 7, 0, 0) + result = ts.make_future_dataframe( + periods=10, + include_history=True) + expected = pd.DataFrame({ + TIME_COL: pd.date_range( + start=dt(2018, 1, 1, 3, 0, 0), + periods=15, + freq="H"), + VALUE_COL: np.concatenate((ts.fit_y, np.repeat(np.nan, 10))) + }) + expected.index = expected[TIME_COL] + expected.index.name = None + assert_frame_equal(result, expected) def test_train_end_date_with_regressors(): """Tests make_future_dataframe and train_end_date with regressors""" - data = generate_df_with_reg_for_tests( - freq="D", - periods=30, - train_start_date=datetime.datetime(2018, 1, 1), - remove_extra_cols=True, - mask_test_actuals=True) - regressor_cols = ["regressor1", "regressor2", "regressor_categ"] - keep_cols = [TIME_COL, VALUE_COL] + regressor_cols - df = data["df"][keep_cols].copy() - # Setting NaN values at the end - df.loc[df.tail(2).index, "regressor1"] = np.nan - df.loc[df.tail(4).index, "regressor2"] = np.nan - df.loc[df.tail(6).index, "regressor_categ"] = np.nan - df.loc[df.tail(8).index, VALUE_COL] = np.nan - - # default train_end_date, default regressor_cols - with pytest.warns(UserWarning) as record: - ts = UnivariateTimeSeries() - ts.load_data(df, TIME_COL, VALUE_COL, train_end_date=None, regressor_cols=None) - assert f"{ts.original_value_col} column of the provided TimeSeries contains null " \ - f"values at the end. Setting 'train_end_date' to the last timestamp with a " \ - f"non-null value ({ts.train_end_date})." in record[0].message.args[0] - assert ts.train_end_date == dt(2018, 1, 22) - assert ts.fit_df.shape == (22, 2) - assert ts.last_date_for_val == df[df[VALUE_COL].notnull()][TIME_COL].max() - assert ts.last_date_for_reg is None - result = ts.make_future_dataframe( - periods=10, - include_history=True) - expected = pd.DataFrame({ - TIME_COL: pd.date_range( - start=dt(2018, 1, 1), - periods=32, - freq="D"), - VALUE_COL: np.concatenate([ts.fit_y, np.repeat(np.nan, 10)]) - }) - expected.index = expected[TIME_COL] - expected.index.name = None - assert_frame_equal(result, expected) - - # train_end_date later than last date in df, all available regressor_cols - with pytest.warns(UserWarning) as record: - ts = UnivariateTimeSeries() - train_end_date = dt(2018, 2, 10) - ts.load_data(df, TIME_COL, VALUE_COL, train_end_date=train_end_date, regressor_cols=regressor_cols) - assert f"Input timestamp for the parameter 'train_end_date' " \ - f"({train_end_date}) either exceeds the last available timestamp or" \ - f"{VALUE_COL} column of the provided TimeSeries contains null " \ - f"values at the end. Setting 'train_end_date' to the last timestamp with a " \ - f"non-null value ({ts.train_end_date})." in record[0].message.args[0] - assert ts.last_date_for_val == dt(2018, 1, 22) - assert ts.last_date_for_reg == dt(2018, 1, 28) - result = ts.make_future_dataframe( - periods=10, - include_history=False) - expected = df.copy()[22:28] - expected.loc[expected.tail(6).index, VALUE_COL] = np.nan - expected.index = expected[TIME_COL] - expected.index.name = None - assert_frame_equal(result, expected) - - # train_end_date in between last date in df and last date before null - # user passes no regressor_cols - with pytest.warns(UserWarning) as record: - ts = UnivariateTimeSeries() - train_end_date = dt(2018, 1, 25) - regressor_cols = [] - ts.load_data(df, TIME_COL, VALUE_COL, train_end_date=train_end_date, regressor_cols=regressor_cols) - assert f"Input timestamp for the parameter 'train_end_date' " \ - f"({train_end_date}) either exceeds the last available timestamp or" \ - f"{VALUE_COL} column of the provided TimeSeries contains null " \ - f"values at the end. Setting 'train_end_date' to the last timestamp with a " \ - f"non-null value ({ts.train_end_date})." in record[0].message.args[0] - assert ts.train_end_date == dt(2018, 1, 22) - assert ts.last_date_for_reg is None - result = ts.make_future_dataframe( - periods=10, - include_history=True) - expected = pd.DataFrame({ - TIME_COL: pd.date_range( - start=dt(2018, 1, 1), - periods=32, - freq="D"), - VALUE_COL: np.concatenate([ts.fit_y, np.repeat(np.nan, 10)]) - }) - expected.index = expected[TIME_COL] - expected.index.name = None - assert_frame_equal(result, expected) - - # train end date equal to last date before null - # user requests a subset of the regressor_cols - with pytest.warns(UserWarning) as record: - ts = UnivariateTimeSeries() - train_end_date = dt(2018, 1, 22) - regressor_cols = ["regressor2"] - ts.load_data(df, TIME_COL, VALUE_COL, train_end_date=train_end_date, regressor_cols=regressor_cols) - assert ts.train_end_date == dt(2018, 1, 22) - assert ts.last_date_for_reg == dt(2018, 1, 26) - result = ts.make_future_dataframe( - periods=10, - include_history=True) - # gathers all warning messages - all_warnings = "" - for i in range(len(record)): - all_warnings += record[i].message.args[0] - assert "Provided periods '10' is more than allowed ('4') due to the length of " \ - "regressor columns. Using '4'." in all_warnings - expected = ts.df.copy()[[TIME_COL, VALUE_COL, "regressor2"]] - expected = expected[expected.index <= ts.last_date_for_reg] - assert_frame_equal(result, expected) - - # train_end_date smaller than last date before null - # user requests regressor_cols that does not exist in df - with pytest.warns(UserWarning) as record: - ts = UnivariateTimeSeries() - train_end_date = dt(2018, 1, 20) - regressor_cols = ["regressor1", "regressor4", "regressor5"] - ts.load_data(df, TIME_COL, VALUE_COL, train_end_date=train_end_date, regressor_cols=regressor_cols) - assert ts.train_end_date == dt(2018, 1, 20) - assert ts.last_date_for_reg == dt(2018, 1, 28) - # gathers all warning messages - all_warnings = "" - for i in range(len(record)): - all_warnings += record[i].message.args[0] - assert (f"The following columns are not available to use as " - f"regressors: ['regressor4', 'regressor5']") in all_warnings - result = ts.make_future_dataframe( - periods=10, - include_history=True) - expected = ts.df.copy()[[TIME_COL, VALUE_COL, "regressor1"]] - expected = expected[expected.index <= ts.last_date_for_reg] - assert_frame_equal(result, expected) + for train_start_date in ["2018-01-01", datetime.datetime(2018, 1, 1)]: + data = generate_df_with_reg_for_tests( + freq="D", + periods=30, + train_start_date=train_start_date, + remove_extra_cols=True, + mask_test_actuals=True) + regressor_cols = ["regressor1", "regressor2", "regressor_categ"] + keep_cols = [TIME_COL, VALUE_COL] + regressor_cols + df = data["df"][keep_cols].copy() + # Setting NaN values at the end + df.loc[df.tail(2).index, "regressor1"] = np.nan + df.loc[df.tail(4).index, "regressor2"] = np.nan + df.loc[df.tail(6).index, "regressor_categ"] = np.nan + df.loc[df.tail(8).index, VALUE_COL] = np.nan + + # default train_end_date, default regressor_cols + with pytest.warns(UserWarning) as record: + ts = UnivariateTimeSeries() + ts.load_data(df, TIME_COL, VALUE_COL, train_end_date=None, regressor_cols=None) + assert f"{ts.original_value_col} column of the provided time series contains " \ + f"null values at the end, or the input `train_end_date` is beyond the last timestamp available. " \ + f"Setting `train_end_date` to the last timestamp with a non-null value " \ + f"({ts.train_end_date})." in record[0].message.args[0] + assert ts.train_end_date == dt(2018, 1, 22) + assert ts.fit_df.shape == (22, 2) + assert ts.last_date_for_val == df[df[VALUE_COL].notnull()][TIME_COL].max() + assert ts.last_date_for_reg is None + result = ts.make_future_dataframe( + periods=10, + include_history=True) + expected = pd.DataFrame({ + TIME_COL: pd.date_range( + start=dt(2018, 1, 1), + periods=32, + freq="D"), + VALUE_COL: np.concatenate([ts.fit_y, np.repeat(np.nan, 10)]) + }) + expected.index = expected[TIME_COL] + expected.index.name = None + assert_frame_equal(result, expected) + + # train_end_date later than last date in df, all available regressor_cols + with pytest.warns(UserWarning) as record: + ts = UnivariateTimeSeries() + train_end_date = dt(2018, 2, 10) + ts.load_data(df, TIME_COL, VALUE_COL, train_end_date=train_end_date, regressor_cols=regressor_cols) + assert f"{VALUE_COL} column of the provided time series contains " \ + f"null values at the end, or the input `train_end_date` is beyond the last timestamp available. " \ + f"Setting `train_end_date` to the last timestamp with a non-null value " \ + f"({dt(2018, 1, 22)})." in record[0].message.args[0] + assert ts.last_date_for_val == dt(2018, 1, 22) + assert ts.last_date_for_reg == dt(2018, 1, 28) + result = ts.make_future_dataframe( + periods=10, + include_history=False) + expected = df.copy()[22:28] + expected.loc[expected.tail(6).index, VALUE_COL] = np.nan + expected.index = expected[TIME_COL] + expected.index.name = None + assert_frame_equal(result, expected) + + # train_end_date in between last date in df and last date before null + # user passes no regressor_cols + with pytest.warns(UserWarning) as record: + ts = UnivariateTimeSeries() + train_end_date = dt(2018, 1, 25) + regressor_cols = [] + ts.load_data(df, TIME_COL, VALUE_COL, train_end_date=train_end_date, regressor_cols=regressor_cols) + assert f"{VALUE_COL} column of the provided time series contains trailing NAs." in record[0].message.args[0] + assert ts.train_end_date == dt(2018, 1, 25) + assert ts.last_date_for_reg is None + result = ts.make_future_dataframe( + periods=10, + include_history=True) + assert len(ts.fit_y) == 25 # Since `train_end_date` is 2018-01-25. + expected = pd.DataFrame({ + TIME_COL: pd.date_range( + start=dt(2018, 1, 1), + periods=35, + freq="D"), + VALUE_COL: np.concatenate([ts.fit_y, np.repeat(np.nan, 10)]) + }) + expected.index = expected[TIME_COL] + expected.index.name = None + assert_frame_equal(result, expected) + + # train end date equal to last date before null + # user requests a subset of the regressor_cols + with pytest.warns(UserWarning) as record: + ts = UnivariateTimeSeries() + train_end_date = dt(2018, 1, 22) + regressor_cols = ["regressor2"] + ts.load_data(df, TIME_COL, VALUE_COL, train_end_date=train_end_date, regressor_cols=regressor_cols) + assert ts.train_end_date == dt(2018, 1, 22) + assert ts.last_date_for_reg == dt(2018, 1, 26) + result = ts.make_future_dataframe( + periods=10, + include_history=True) + # gathers all warning messages + all_warnings = "" + for i in range(len(record)): + all_warnings += record[i].message.args[0] + assert "Provided periods '10' is more than allowed ('4') due to the length of " \ + "regressor columns. Using '4'." in all_warnings + expected = ts.df.copy()[[TIME_COL, VALUE_COL, "regressor2"]] + expected = expected[expected.index <= ts.last_date_for_reg] + assert_frame_equal(result, expected) + + # train_end_date smaller than last date before null + # user requests regressor_cols that does not exist in df + with pytest.warns(UserWarning) as record: + ts = UnivariateTimeSeries() + train_end_date = dt(2018, 1, 20) + regressor_cols = ["regressor1", "regressor4", "regressor5"] + ts.load_data(df, TIME_COL, VALUE_COL, train_end_date=train_end_date, regressor_cols=regressor_cols) + assert ts.train_end_date == dt(2018, 1, 20) + assert ts.last_date_for_reg == dt(2018, 1, 28) + # gathers all warning messages + all_warnings = "" + for i in range(len(record)): + all_warnings += record[i].message.args[0] + assert (f"The following columns are not available to use as " + f"regressors: ['regressor4', 'regressor5']") in all_warnings + result = ts.make_future_dataframe( + periods=10, + include_history=True) + expected = ts.df.copy()[[TIME_COL, VALUE_COL, "regressor1"]] + expected = expected[expected.index <= ts.last_date_for_reg] + assert_frame_equal(result, expected) def test_plot(): diff --git a/greykite/tests/framework/output/test_univariate_forecast.py b/greykite/tests/framework/output/test_univariate_forecast.py index 50d803d..baa5c7c 100644 --- a/greykite/tests/framework/output/test_univariate_forecast.py +++ b/greykite/tests/framework/output/test_univariate_forecast.py @@ -6,8 +6,8 @@ import numpy as np import pandas as pd import pytest -from pandas.util.testing import assert_frame_equal -from pandas.util.testing import assert_series_equal +from pandas.testing import assert_frame_equal +from pandas.testing import assert_series_equal from sklearn.pipeline import Pipeline from greykite.common import constants as cst @@ -20,6 +20,7 @@ from greykite.framework.pipeline.utils import get_forecast from greykite.sklearn.estimator.prophet_estimator import ProphetEstimator from greykite.sklearn.estimator.silverkite_estimator import SilverkiteEstimator +from greykite.sklearn.estimator.testing_utils import params_component_breakdowns try: @@ -61,6 +62,11 @@ def df2(): }) +@pytest.fixture +def expected_component_names(): + return params_component_breakdowns()["expected_component_names"] + + def test_univariate_forecast(df): """Checks univariate forecast class""" # Without test_start_date @@ -733,8 +739,8 @@ def test_make_univariate_time_series(df): assert forecast.make_univariate_time_series().df.equals(ts.df) -def test_plot_components(): - """Test plot_components of UnivariateForecast class""" +def test_plot_components(expected_component_names): + """Tests "plot_components" of UnivariateForecast class""" X = pd.DataFrame({ cst.TIME_COL: pd.date_range("2018-01-01", periods=10, freq="D"), cst.VALUE_COL: np.arange(1, 11) @@ -745,31 +751,19 @@ def test_plot_components(): trained_model = Pipeline([("estimator", SilverkiteEstimator(coverage=coverage))]) with pytest.warns(Warning) as record: trained_model.fit(X, X[cst.VALUE_COL]) - assert "No slice had sufficient sample size" in record[0].message.args[0] + assert "Zero degrees of freedom" in record[0].message.args[0] + assert "No slice had sufficient sample size" in record[1].message.args[0] forecast = get_forecast(X, trained_model) - with pytest.warns(Warning) as record: - title = "Custom component plot" - fig = forecast.plot_components(names=["trend", "YEARLY_SEASONALITY", "DUMMY"], title=title) - - expected_rows = 3 - assert len(fig.data) == expected_rows - assert [fig.data[i].name for i in range(expected_rows)] == \ - [cst.VALUE_COL, "trend", "YEARLY_SEASONALITY"] - - assert fig.layout.xaxis.title["text"] == cst.TIME_COL - assert fig.layout.xaxis2.title["text"] == cst.TIME_COL - assert fig.layout.xaxis3.title["text"] == "Time of year" - - assert fig.layout.yaxis.title["text"] == cst.VALUE_COL - assert fig.layout.yaxis2.title["text"] == "trend" - assert fig.layout.yaxis3.title["text"] == "yearly" - - assert fig.layout.title["text"] == title - assert fig.layout.title["x"] == 0.5 - - assert f"The following components have not been specified in the model: " \ - f"{{'DUMMY'}}, plotting the rest." in record[0].message.args[0] + # Tests plot_components + title = "Custom component plot" + fig = forecast.plot_components(title=title) + expected_rows = 9 + assert len(fig.data) == expected_rows # includes changepoints + assert all([fig.data[i].name in expected_component_names for i in range(expected_rows)]) + assert fig.layout.xaxis.title["text"] == "Date" + assert fig.layout.title["text"] == title + assert fig.layout.title["x"] == 0.5 @pytest.mark.skipif("prophet" not in sys.modules, diff --git a/greykite/tests/framework/pipeline/test_pipeline.py b/greykite/tests/framework/pipeline/test_pipeline.py index 07db0e1..8186b03 100644 --- a/greykite/tests/framework/pipeline/test_pipeline.py +++ b/greykite/tests/framework/pipeline/test_pipeline.py @@ -359,8 +359,9 @@ def test_train_end_date_gap(): pipeline=pipeline, forecast_horizon=10) ts = result.timeseries - assert f"{ts.original_value_col} column of the provided TimeSeries contains null " \ - f"values at the end. Setting 'train_end_date' to the last timestamp with a " \ + assert f"{ts.original_value_col} column of the provided time series contains null " \ + f"values at the end, or the input `train_end_date` is beyond the last timestamp available. " \ + f"Setting `train_end_date` to the last timestamp with a " \ f"non-null value ({ts.train_end_date})." in record[0].message.args[0] assert ts.train_end_date == datetime.datetime(2018, 1, 25) assert result.forecast.test_evaluation is None @@ -374,11 +375,10 @@ def test_train_end_date_gap(): pipeline=pipeline, forecast_horizon=5) ts = result.timeseries - assert f"Input timestamp for the parameter 'train_end_date' " \ - f"({train_end_date}) either exceeds the last available timestamp or" \ - f"{VALUE_COL} column of the provided TimeSeries contains null " \ - f"values at the end. Setting 'train_end_date' to the last timestamp with a " \ - f"non-null value ({ts.train_end_date})." in record[0].message.args[0] + assert f"{VALUE_COL} column of the provided time series contains " \ + f"null values at the end, or the input `train_end_date` is beyond the last timestamp available. " \ + f"Setting `train_end_date` to the last timestamp with a non-null value " \ + f"({datetime.datetime(2018, 1, 25)})." in record[0].message.args[0] assert ts.train_end_date == datetime.datetime(2018, 1, 25) assert result.forecast.test_evaluation is None @@ -450,9 +450,10 @@ def test_train_end_date_gap_regressors(): pipeline=get_dummy_pipeline(include_preprocessing=True), forecast_horizon=10) ts = result.timeseries - assert f"{ts.original_value_col} column of the provided TimeSeries contains " \ - f"null values at the end. Setting 'train_end_date' to the last timestamp with a " \ - f"non-null value ({ts.train_end_date})." in record[0].message.args[0] + assert f"{ts.original_value_col} column of the provided time series contains " \ + f"null values at the end, or the input `train_end_date` is beyond the last timestamp available. " \ + f"Setting `train_end_date` to the last timestamp with a non-null value " \ + f"({ts.train_end_date})." in record[0].message.args[0] assert ts.train_end_date == datetime.datetime(2018, 2, 21) assert ts.last_date_for_reg is None assert result.forecast.test_evaluation is None @@ -469,11 +470,10 @@ def test_train_end_date_gap_regressors(): regressor_cols=regressor_cols), forecast_horizon=5) ts = result.timeseries - assert f"Input timestamp for the parameter 'train_end_date' " \ - f"({train_end_date}) either exceeds the last available timestamp or" \ - f"{VALUE_COL} column of the provided TimeSeries contains null " \ - f"values at the end. Setting 'train_end_date' to the last timestamp with a " \ - f"non-null value ({ts.train_end_date})." in record[0].message.args[0] + assert f"{VALUE_COL} column of the provided time series contains " \ + f"null values at the end, or the input `train_end_date` is beyond the last timestamp available. " \ + f"Setting `train_end_date` to the last timestamp with a non-null value " \ + f"({datetime.datetime(2018, 2, 21)})." in record[0].message.args[0] assert ts.train_end_date == datetime.datetime(2018, 2, 21) assert ts.last_date_for_reg == datetime.datetime(2018, 3, 1) forecast = result.forecast @@ -493,15 +493,11 @@ def test_train_end_date_gap_regressors(): regressor_cols=[]), forecast_horizon=5) ts = result.timeseries - assert f"Input timestamp for the parameter 'train_end_date' " \ - f"({train_end_date}) either exceeds the last available timestamp or" \ - f"{VALUE_COL} column of the provided TimeSeries contains null " \ - f"values at the end. Setting 'train_end_date' to the last timestamp with a " \ - f"non-null value ({ts.train_end_date})." in record[0].message.args[0] - assert ts.train_end_date == datetime.datetime(2018, 2, 21) + assert f"{VALUE_COL} column of the provided time series contains trailing NAs" in record[0].message.args[0] + assert ts.train_end_date == train_end_date # Unchanged when `train_end_date` is specified. assert ts.last_date_for_reg is None forecast = result.forecast - assert forecast.df[TIME_COL].max() == datetime.datetime(2018, 2, 26) + assert forecast.df[TIME_COL].max() == datetime.datetime(2018, 3, 3) assert forecast.test_evaluation is None # `train_end_date` smaller than last date before null, diff --git a/greykite/tests/framework/pipeline/test_utils.py b/greykite/tests/framework/pipeline/test_utils.py index 95accf7..cc98785 100644 --- a/greykite/tests/framework/pipeline/test_utils.py +++ b/greykite/tests/framework/pipeline/test_utils.py @@ -690,7 +690,9 @@ def test_run_hyperparameter_searcher(): "constant": sp_randint(1, 30) } ] - with pytest.warns(UserWarning) as record: + with pytest.warns( + UserWarning, + match="There is only one CV split"): # full hyperparameter_grid search, default 10 grid_search = run_dummy_grid_search( random_hyperparameter_grid, @@ -737,7 +739,6 @@ def test_run_hyperparameter_searcher(): score_func=mean_squared_error, greater_is_better=False, cv_report_metrics_names=expected_names) - assert "There is only one CV split" in record[0].message.args[0] def test_get_forecast(): diff --git a/greykite/tests/framework/templates/test_forecast_config.py b/greykite/tests/framework/templates/test_forecast_config.py index 5af29d3..0a84288 100644 --- a/greykite/tests/framework/templates/test_forecast_config.py +++ b/greykite/tests/framework/templates/test_forecast_config.py @@ -282,7 +282,7 @@ def test_forecast_config_json(): assert config.to_dict() -def assert_forecast_config_json_multiple_model_componments_parameter(config: Optional[ForecastConfig] = None): +def assert_forecast_config_json_multiple_model_components_parameter(config: Optional[ForecastConfig] = None): """Asserts the forecast config values. This function expects a particular config and is not generic""" config = ForecastConfigDefaults().apply_forecast_config_defaults(config) assert config.model_template == [ModelTemplateEnum.SILVERKITE.name, @@ -318,7 +318,7 @@ def assert_forecast_config_json_multiple_model_componments_parameter(config: Opt assert config.to_dict() # runs without error -def test_forecast_config_json_multiple_model_componments_parameter(): +def test_forecast_config_json_multiple_model_components_parameter(): """Tests ForecastConfig json with a list of model_template and model_components_param parameters""" json_str = """{ "model_template": ["SILVERKITE", "SILVERKITE_DAILY_90", "SILVERKITE_WEEKLY"], @@ -391,7 +391,7 @@ def test_forecast_config_json_multiple_model_componments_parameter(): }""" forecast_dict = json.loads(json_str) config = forecast_config_from_dict(forecast_dict) - assert_forecast_config_json_multiple_model_componments_parameter(config) + assert_forecast_config_json_multiple_model_components_parameter(config) # Null values inside `model_template` and `model_components_param` json_str = """{ diff --git a/greykite/tests/framework/templates/test_forecaster.py b/greykite/tests/framework/templates/test_forecaster.py index 0e7057f..ac7f892 100644 --- a/greykite/tests/framework/templates/test_forecaster.py +++ b/greykite/tests/framework/templates/test_forecaster.py @@ -515,7 +515,7 @@ def test_run_forecast_config_custom(): score_func=metric.name, greater_is_better=False) - # Note that for newer scikit-learn version, needs to add a check for ValueError, matching "model is misconfigured" + # Note that for newer scikit-learn (1.1+), we need to add a check for ValueError, matching "model is misconfigured" with pytest.raises((ValueError, KeyError)) as exception_info: model_components = ModelComponentsParam( regressors={ @@ -681,6 +681,8 @@ def test_run_forecast_config_with_single_simple_silverkite_template(): "estimator__holiday_post_num_days": [0], "estimator__holiday_pre_post_num_dict": [None], "estimator__daily_event_df_dict": [None], + "estimator__daily_event_neighbor_impact": [None], + "estimator__daily_event_shifted_effect": [None], "estimator__auto_growth": [False], "estimator__changepoints_dict": [None], "estimator__seasonality_changepoints_dict": [None], @@ -707,6 +709,7 @@ def test_run_forecast_config_with_single_simple_silverkite_template(): "estimator__drop_pred_cols": [None], "estimator__explicit_pred_cols": [None], "estimator__regression_weight_col": [None], + "estimator__remove_intercept": [False] }, ignore_keys={"estimator__time_properties": None} ) @@ -788,8 +791,8 @@ def test_estimator_get_coef_summary_from_forecaster(): def test_auto_model_template(): df = pd.DataFrame({ - "ts": pd.date_range("2020-01-01", freq="D", periods=100), - "y": range(100) + "ts": pd.date_range("2020-01-01", freq="D", periods=60), + "y": range(60) }) config = ForecastConfig( model_template=ModelTemplateEnum.AUTO.name, @@ -812,7 +815,7 @@ def test_auto_model_template(): assert forecaster.config.model_template == ModelTemplateEnum.SILVERKITE_DAILY_1_CONFIG_1.name # Not able to infer frequency, so the default is SILVERKITE - df = df.drop([1]) # drops the second row + df = df.drop([1, 24, 55]) # drops rows within every block size of 20 config.metadata_param.freq = None forecaster = Forecaster() assert forecaster._Forecaster__get_model_template(df, config) == ModelTemplateEnum.SILVERKITE.name diff --git a/greykite/tests/framework/templates/test_multistage_forecast_template.py b/greykite/tests/framework/templates/test_multistage_forecast_template.py index 7b626c5..77efa88 100644 --- a/greykite/tests/framework/templates/test_multistage_forecast_template.py +++ b/greykite/tests/framework/templates/test_multistage_forecast_template.py @@ -335,6 +335,8 @@ def test_get_hyperparameter_grid_extra_configs(df, forecast_config): 'holiday_post_num_days': 1, 'holiday_pre_post_num_dict': None, 'daily_event_df_dict': None, + 'daily_event_neighbor_impact': None, + 'daily_event_shifted_effect': None, 'feature_sets_enabled': 'auto', 'fit_algorithm_dict': { 'fit_algorithm': 'ridge', @@ -353,6 +355,7 @@ def test_get_hyperparameter_grid_extra_configs(df, forecast_config): 'regressor_cols': [], 'lagged_regressor_dict': None, 'regression_weight_col': None, + 'remove_intercept': False, 'uncertainty_dict': None, 'origin_for_time_vars': None, 'train_test_thresh': None, @@ -375,6 +378,8 @@ def test_get_hyperparameter_grid_extra_configs(df, forecast_config): 'holiday_post_num_days': 0, 'holiday_pre_post_num_dict': None, 'daily_event_df_dict': None, + 'daily_event_neighbor_impact': None, + 'daily_event_shifted_effect': None, 'feature_sets_enabled': 'auto', 'fit_algorithm_dict': { 'fit_algorithm': 'ridge', @@ -393,6 +398,7 @@ def test_get_hyperparameter_grid_extra_configs(df, forecast_config): 'regressor_cols': [], 'lagged_regressor_dict': None, 'regression_weight_col': None, + 'remove_intercept': False, 'uncertainty_dict': None, 'origin_for_time_vars': None, 'train_test_thresh': None, @@ -449,6 +455,7 @@ def test_get_multistage_forecast_configs_override(df, forecast_config): 'drop_pred_cols': None, 'explicit_pred_cols': None, 'regression_weight_col': None, + 'remove_intercept': False, 'normalize_method': 'zero_to_one' }, events={ @@ -457,7 +464,9 @@ def test_get_multistage_forecast_configs_override(df, forecast_config): 'holiday_pre_num_days': 1, 'holiday_post_num_days': 1, 'holiday_pre_post_num_dict': None, - 'daily_event_df_dict': None + 'daily_event_df_dict': None, + 'daily_event_neighbor_impact': None, + 'daily_event_shifted_effect': None }, growth={ 'growth_term': 'linear' @@ -506,6 +515,7 @@ def test_get_multistage_forecast_configs_override(df, forecast_config): 'drop_pred_cols': None, 'explicit_pred_cols': None, 'regression_weight_col': None, + 'remove_intercept': False, 'normalize_method': 'zero_to_one' }, events={ @@ -514,7 +524,9 @@ def test_get_multistage_forecast_configs_override(df, forecast_config): 'holiday_pre_num_days': 0, 'holiday_post_num_days': 0, 'holiday_pre_post_num_dict': None, - 'daily_event_df_dict': None + 'daily_event_df_dict': None, + 'daily_event_neighbor_impact': None, + 'daily_event_shifted_effect': None }, growth={ 'growth_term': None @@ -587,6 +599,8 @@ def test_get_estimators_and_params_from_template_configs(df, forecast_config): 'estimator__holiday_post_num_days': [1], 'estimator__holiday_pre_post_num_dict': [None], 'estimator__daily_event_df_dict': [None], + 'estimator__daily_event_neighbor_impact': [None], + 'estimator__daily_event_shifted_effect': [None], 'estimator__feature_sets_enabled': ['auto'], 'estimator__fit_algorithm_dict': [{ 'fit_algorithm': 'ridge', @@ -605,6 +619,7 @@ def test_get_estimators_and_params_from_template_configs(df, forecast_config): 'estimator__regressor_cols': [[]], 'estimator__lagged_regressor_dict': [None], 'estimator__regression_weight_col': [None], + 'estimator__remove_intercept': [False], 'estimator__uncertainty_dict': [None], 'estimator__origin_for_time_vars': [None], 'estimator__train_test_thresh': [None], @@ -628,6 +643,8 @@ def test_get_estimators_and_params_from_template_configs(df, forecast_config): 'estimator__holiday_post_num_days': [0], 'estimator__holiday_pre_post_num_dict': [None], 'estimator__daily_event_df_dict': [None], + 'estimator__daily_event_neighbor_impact': [None], + 'estimator__daily_event_shifted_effect': [None], 'estimator__feature_sets_enabled': ['auto'], 'estimator__fit_algorithm_dict': [{ 'fit_algorithm': 'ridge', @@ -646,6 +663,7 @@ def test_get_estimators_and_params_from_template_configs(df, forecast_config): 'estimator__regressor_cols': [[]], 'estimator__lagged_regressor_dict': [None], 'estimator__regression_weight_col': [None], + 'estimator__remove_intercept': [False], 'estimator__uncertainty_dict': [None], 'estimator__origin_for_time_vars': [None], 'estimator__train_test_thresh': [None], diff --git a/greykite/tests/framework/templates/test_multistage_forecast_template_config.py b/greykite/tests/framework/templates/test_multistage_forecast_template_config.py index e612ff6..5c00cf0 100644 --- a/greykite/tests/framework/templates/test_multistage_forecast_template_config.py +++ b/greykite/tests/framework/templates/test_multistage_forecast_template_config.py @@ -58,6 +58,8 @@ def test_multistage_forecast(): "holiday_post_num_days": 1, "holiday_pre_post_num_dict": None, "daily_event_df_dict": None, + "daily_event_neighbor_impact": None, + "daily_event_shifted_effect": None }, changepoints={ "changepoints_dict": { @@ -97,6 +99,7 @@ def test_multistage_forecast(): "drop_pred_cols": None, "explicit_pred_cols": None, "regression_weight_col": None, + "remove_intercept": False, "normalize_method": "zero_to_one" } ) @@ -124,6 +127,8 @@ def test_multistage_forecast(): "holiday_post_num_days": 0, "holiday_pre_post_num_dict": None, "daily_event_df_dict": None, + "daily_event_neighbor_impact": None, + "daily_event_shifted_effect": None }, changepoints={ "changepoints_dict": None, @@ -155,6 +160,7 @@ def test_multistage_forecast(): "drop_pred_cols": None, "explicit_pred_cols": None, "regression_weight_col": None, + "remove_intercept": False, "normalize_method": "zero_to_one" } ) @@ -189,6 +195,8 @@ def test_multistage_forecast_silverkite_wow(): "holiday_post_num_days": 0, "holiday_pre_post_num_dict": None, "daily_event_df_dict": None, + "daily_event_neighbor_impact": None, + "daily_event_shifted_effect": None }, changepoints={ "auto_growth": True, @@ -221,6 +229,7 @@ def test_multistage_forecast_silverkite_wow(): "drop_pred_cols": None, "explicit_pred_cols": None, "regression_weight_col": None, + "remove_intercept": False, "normalize_method": "zero_to_one" } ) diff --git a/greykite/tests/framework/templates/test_prophet_template.py b/greykite/tests/framework/templates/test_prophet_template.py index 28de556..49cce7a 100644 --- a/greykite/tests/framework/templates/test_prophet_template.py +++ b/greykite/tests/framework/templates/test_prophet_template.py @@ -5,7 +5,7 @@ import numpy as np import pandas as pd import pytest -from pandas.util.testing import assert_frame_equal +from pandas.testing import assert_frame_equal import greykite.common.constants as cst from greykite.common.evaluation import EvaluationMetricEnum diff --git a/greykite/tests/framework/templates/test_silverkite_template.py b/greykite/tests/framework/templates/test_silverkite_template.py index 5fc78e3..0c2ca73 100644 --- a/greykite/tests/framework/templates/test_silverkite_template.py +++ b/greykite/tests/framework/templates/test_silverkite_template.py @@ -35,7 +35,6 @@ from greykite.framework.templates.silverkite_template import get_extra_pred_cols from greykite.framework.utils.framework_testing_utils import assert_basic_pipeline_equal from greykite.framework.utils.framework_testing_utils import check_forecast_pipeline_result -from greykite.sklearn.estimator.silverkite_diagnostics import SilverkiteDiagnostics from greykite.sklearn.estimator.silverkite_estimator import SilverkiteEstimator @@ -45,12 +44,7 @@ def silverkite(): @pytest.fixture -def silverkite_diagnostics(): - return SilverkiteDiagnostics() - - -@pytest.fixture -def model_components_param(silverkite, silverkite_diagnostics): +def model_components_param(silverkite): return ModelComponentsParam( seasonality={ "fs_components_df": None @@ -78,7 +72,6 @@ def model_components_param(silverkite, silverkite_diagnostics): }, custom={ "silverkite": silverkite, - "silverkite_diagnostics": silverkite_diagnostics, "extra_pred_cols": [["ct1"], ["ct2"], ["regressor1", "regressor3"]], "max_admissible_value": 4, } @@ -115,7 +108,7 @@ def test_get_extra_pred_cols(): assert set(extra_pred_cols) == {"p1", "p2", "p3"} -def test_apply_default_model_components(model_components_param, silverkite, silverkite_diagnostics): +def test_apply_default_model_components(model_components_param, silverkite): model_components = apply_default_model_components() assert_equal(model_components.seasonality, { "fs_components_df": [pd.DataFrame({ @@ -127,6 +120,8 @@ def test_apply_default_model_components(model_components_param, silverkite, silv assert model_components.growth == {} assert model_components.events == { "daily_event_df_dict": [None], + "daily_event_neighbor_impact": [None], + "daily_event_shifted_effect": [None] } assert model_components.changepoints == { "changepoints_dict": [None], @@ -143,7 +138,6 @@ def test_apply_default_model_components(model_components_param, silverkite, silv } assert_equal(model_components.custom, { "silverkite": [SilverkiteForecast()], - "silverkite_diagnostics": [SilverkiteDiagnostics()], "origin_for_time_vars": [None], "extra_pred_cols": ["ct1"], # linear growth "drop_pred_cols": [None], @@ -155,13 +149,12 @@ def test_apply_default_model_components(model_components_param, silverkite, silv "min_admissible_value": [None], "max_admissible_value": [None], "regression_weight_col": [None], + "remove_intercept": [False], "normalize_method": [None] }, ignore_keys={ "silverkite": None, - "silverkite_diagnostics": None }) assert model_components.custom["silverkite"][0] != silverkite # a different instance was created - assert model_components.custom["silverkite_diagnostics"][0] != silverkite_diagnostics # overwrite some parameters time_properties = { @@ -175,6 +168,8 @@ def test_apply_default_model_components(model_components_param, silverkite, silv assert updated_components.seasonality == model_components_param.seasonality assert updated_components.events == { "daily_event_df_dict": [None], + "daily_event_neighbor_impact": [None], + "daily_event_shifted_effect": [None] } assert updated_components.changepoints == { "changepoints_dict": { # combination of defaults and provided params @@ -190,7 +185,6 @@ def test_apply_default_model_components(model_components_param, silverkite, silv assert updated_components.uncertainty == model_components_param.uncertainty assert updated_components.custom == { # combination of defaults and provided params "silverkite": silverkite, # the same object that was passed in (not a copy) - "silverkite_diagnostics": silverkite_diagnostics, "origin_for_time_vars": [time_properties["origin_for_time_vars"]], # from time_properties "extra_pred_cols": [["ct1"], ["ct2"], ["regressor1", "regressor3"]], "drop_pred_cols": [None], @@ -203,6 +197,7 @@ def test_apply_default_model_components(model_components_param, silverkite, silv "min_admissible_value": [None], "normalize_method": [None], "regression_weight_col": [None], + "remove_intercept": [False] } # `time_properties` without start_year key @@ -330,13 +325,12 @@ def test_get_lagged_regressor_info(): assert lagged_regressor_info["overall_max_lag_order"] == 21 -def test_get_silverkite_hyperparameter_grid(model_components_param, silverkite, silverkite_diagnostics): +def test_get_silverkite_hyperparameter_grid(model_components_param, silverkite): template = SilverkiteTemplate() template.config = template.apply_forecast_config_defaults() hyperparameter_grid = template.get_hyperparameter_grid() expected_grid = { "estimator__silverkite": [SilverkiteForecast()], - "estimator__silverkite_diagnostics": [SilverkiteDiagnostics()], "estimator__origin_for_time_vars": [None], "estimator__extra_pred_cols": [["ct1"]], "estimator__drop_pred_cols": [None], @@ -347,6 +341,8 @@ def test_get_silverkite_hyperparameter_grid(model_components_param, silverkite, "fit_algorithm": "linear", "fit_algorithm_params": None}], "estimator__daily_event_df_dict": [None], + "estimator__daily_event_neighbor_impact": [None], + "estimator__daily_event_shifted_effect": [None], "estimator__fs_components_df": [pd.DataFrame({ "name": ["tod", "tow", "tom", "toq", "toy"], "period": [24.0, 7.0, 1.0, 1.0, 1.0], @@ -363,14 +359,14 @@ def test_get_silverkite_hyperparameter_grid(model_components_param, silverkite, "estimator__max_admissible_value": [None], "estimator__normalize_method": [None], "estimator__regression_weight_col": [None], + "estimator__remove_intercept": [False], "estimator__uncertainty_dict": [None], } assert_equal( hyperparameter_grid, expected_grid, - ignore_keys={"estimator__silverkite": None, "estimator__silverkite_diagnostics": None}) + ignore_keys={"estimator__silverkite": None}) assert hyperparameter_grid["estimator__silverkite"][0] != silverkite - assert hyperparameter_grid["estimator__silverkite_diagnostics"][0] != silverkite_diagnostics # Tests auto-list conversion template.config.model_components_param = model_components_param @@ -378,7 +374,6 @@ def test_get_silverkite_hyperparameter_grid(model_components_param, silverkite, hyperparameter_grid = template.get_hyperparameter_grid() expected_grid = { "estimator__silverkite": [silverkite], - "estimator__silverkite_diagnostics": [silverkite_diagnostics], "estimator__origin_for_time_vars": [2020], "estimator__extra_pred_cols": [["ct1"], ["ct2"], ["regressor1", "regressor3"]], "estimator__drop_pred_cols": [None], @@ -390,6 +385,8 @@ def test_get_silverkite_hyperparameter_grid(model_components_param, silverkite, "fit_algorithm_params": None, }], "estimator__daily_event_df_dict": [None], + "estimator__daily_event_neighbor_impact": [None], + "estimator__daily_event_shifted_effect": [None], "estimator__fs_components_df": [None], "estimator__autoreg_dict": [None], "estimator__simulation_num": [10], @@ -405,6 +402,7 @@ def test_get_silverkite_hyperparameter_grid(model_components_param, silverkite, "estimator__max_admissible_value": [4], "estimator__normalize_method": [None], "estimator__regression_weight_col": [None], + "estimator__remove_intercept": [False], "estimator__uncertainty_dict": [{ "uncertainty_method": "simple_conditional_residuals" }], @@ -757,10 +755,10 @@ def test_run_template_2(): q80 = EvaluationMetricEnum.Quantile80.get_metric_name() assert result.backtest.test_evaluation[rmse] == pytest.approx(2.692, rel=1e-2) assert result.backtest.test_evaluation[q80] == pytest.approx(1.531, rel=1e-2) - assert result.backtest.test_evaluation[PREDICTION_BAND_COVERAGE] == pytest.approx(0.823, rel=1e-2) + assert result.backtest.test_evaluation[PREDICTION_BAND_COVERAGE] == pytest.approx(0.835, rel=1e-2) assert result.forecast.train_evaluation[rmse] == pytest.approx(2.304, rel=1e-2) assert result.forecast.train_evaluation[q80] == pytest.approx(0.921, rel=1e-2) - assert result.forecast.train_evaluation[PREDICTION_BAND_COVERAGE] == pytest.approx(0.897, rel=1e-2) + assert result.forecast.train_evaluation[PREDICTION_BAND_COVERAGE] == pytest.approx(0.910, rel=1e-2) check_forecast_pipeline_result( result, coverage=0.9, @@ -842,7 +840,7 @@ def test_run_template_4(): config=config, ) rmse = EvaluationMetricEnum.RootMeanSquaredError.get_metric_name() - assert result.backtest.test_evaluation[rmse] == pytest.approx(4.95, rel=1e-1) + assert result.backtest.test_evaluation[rmse] == pytest.approx(4.40, rel=1e-1) check_forecast_pipeline_result( result, coverage=0.9, @@ -917,8 +915,12 @@ def test_run_template_6(): autoregression=dict(autoreg_dict=dict(lag_dict=dict(orders=[1]))), lagged_regressors={ "lagged_regressor_dict": [ - {"regressor2": "auto"}, - {"regressor_categ": {"lag_dict": {"orders": [5]}}} + { + "regressor_categ": {"lag_dict": {"orders": [1, 5]}} + }, + { + "regressor_categ": {"lag_dict": {"orders": [5]}} + } ]}, uncertainty=dict(uncertainty_dict=None)) config = ForecastConfig( @@ -935,7 +937,7 @@ def test_run_template_6(): config=config, ) rmse = EvaluationMetricEnum.RootMeanSquaredError.get_metric_name() - assert result.backtest.test_evaluation[rmse] == pytest.approx(4.46, rel=1e-1) + assert result.backtest.test_evaluation[rmse] == pytest.approx(5.44, rel=1e-1) check_forecast_pipeline_result( result, coverage=0.9, @@ -943,6 +945,10 @@ def test_run_template_6(): score_func=EvaluationMetricEnum.MeanAbsolutePercentError.name, greater_is_better=False) # Checks lagged regressor columns + # Both models include lagged regressors, since a CV was run, + # the best model picked turns out to be the one with + # `"regressor_categ": {"lag_dict": {"orders": [1, 5]}`. + # But either model must contain `lag5`, we verify its existence. actual_pred_cols = set(result.model[-1].model_dict["pred_cols"]) actual_x_mat_cols = set(result.model[-1].model_dict["x_mat"].columns) expected_pred_cols = { diff --git a/greykite/tests/framework/templates/test_simple_silverkite_template.py b/greykite/tests/framework/templates/test_simple_silverkite_template.py index 8307595..c2d530c 100644 --- a/greykite/tests/framework/templates/test_simple_silverkite_template.py +++ b/greykite/tests/framework/templates/test_simple_silverkite_template.py @@ -1,11 +1,13 @@ import dataclasses import datetime import warnings +from datetime import timedelta from typing import Type import numpy as np import pandas as pd import pytest +from sklearn.exceptions import ConvergenceWarning from testfixtures import LogCapture from greykite.algo.forecast.silverkite.constants.silverkite_constant import SilverkiteConstant @@ -41,7 +43,6 @@ from greykite.framework.utils.framework_testing_utils import assert_basic_pipeline_equal from greykite.framework.utils.framework_testing_utils import check_forecast_pipeline_result from greykite.framework.utils.result_summary import summarize_grid_search_results -from greykite.sklearn.estimator.silverkite_diagnostics import SilverkiteDiagnostics from greykite.sklearn.estimator.simple_silverkite_estimator import SimpleSilverkiteEstimator @@ -51,8 +52,44 @@ def silverkite(): @pytest.fixture -def silverkite_diagnostics(): - return SilverkiteDiagnostics() +def forecast_data(): + """Loads and prepares some real and simulated data sets for performance testing.""" + dl = DataLoader() + df_pt = dl.load_peyton_manning() + # This adds a small number to avoid very small observed values in MAPE denominator + df_pt["y"] = df_pt["y"] + 1 + + agg_func = {"count": "sum", "tmin": "mean", "tmax": "mean", "pn": "mean"} + df_bk = dl.load_bikesharing(agg_freq="daily", agg_func=agg_func) + # This adds a small number to avoid zeros in MAPE calculation + df_bk["count"] += 10 + # Drops last value as data might be incorrect since the original data is hourly + df_bk.drop(df_bk.tail(1).index, inplace=True) + df_bk.reset_index(drop=True, inplace=True) + df_bk = df_bk[["ts", "count"]] + df_bk.columns = ["ts", "y"] + + autoreg_coefs = [0.5] * 15 + periods = 600 + np.random.seed(1317) + data = generate_df_for_tests( + freq="D", + periods=periods + len(autoreg_coefs), # Generates more data to avoid missing at the end + train_frac=0.8, + train_end_date=None, + noise_std=0.5, + remove_extra_cols=True, + autoreg_coefs=autoreg_coefs, + fs_coefs=[0.1, 1, 0.1], + growth_coef=2.0, + intercept=10.0) + + df_sim = data["df"][:periods] # Removes last few values which are Nan due to using of lags in the data generation + df_dict = {} + df_dict["daily_pt"] = df_pt[:periods] + df_dict["daily_bk"] = df_bk[:periods] + df_dict["sim"] = df_sim + return df_dict class MySilverkiteHoliday(SilverkiteHoliday): @@ -258,6 +295,8 @@ def test_get_single_model_components_param_from_template(): "holiday_post_num_days": 2, "holiday_pre_post_num_dict": None, "daily_event_df_dict": None, + "daily_event_neighbor_impact": None, + "daily_event_shifted_effect": None }, changepoints={ "auto_growth": False, @@ -300,6 +339,7 @@ def test_get_single_model_components_param_from_template(): "regression_weight_col": None, "min_admissible_value": None, "max_admissible_value": None, + "remove_intercept": False, "normalize_method": "zero_to_one" } ) @@ -325,6 +365,8 @@ def test_get_single_model_components_param_from_template(): "holiday_post_num_days": 2, "holiday_pre_post_num_dict": None, "daily_event_df_dict": None, + "daily_event_neighbor_impact": None, + "daily_event_shifted_effect": None }, changepoints={ "auto_growth": False, @@ -367,6 +409,7 @@ def test_get_single_model_components_param_from_template(): "regression_weight_col": None, "min_admissible_value": None, "max_admissible_value": None, + "remove_intercept": False, "normalize_method": "zero_to_one" } ) @@ -391,6 +434,8 @@ def test_get_single_model_components_param_from_template(): "holiday_post_num_days": 2, "holiday_pre_post_num_dict": None, "daily_event_df_dict": None, + "daily_event_neighbor_impact": None, + "daily_event_shifted_effect": None }, changepoints={ "auto_growth": False, @@ -433,6 +478,7 @@ def test_get_single_model_components_param_from_template(): "regression_weight_col": None, "min_admissible_value": None, "max_admissible_value": None, + "remove_intercept": False, "normalize_method": "zero_to_one" } ) @@ -457,6 +503,8 @@ def test_get_single_model_components_param_from_template(): "holiday_post_num_days": 2, "holiday_pre_post_num_dict": None, "daily_event_df_dict": None, + "daily_event_neighbor_impact": None, + "daily_event_shifted_effect": None }, changepoints={ "auto_growth": False, @@ -499,6 +547,7 @@ def test_get_single_model_components_param_from_template(): "regression_weight_col": None, "min_admissible_value": None, "max_admissible_value": None, + "remove_intercept": False, "normalize_method": "zero_to_one" } ) @@ -524,6 +573,8 @@ def test_get_single_model_components_param_from_template(): "holiday_post_num_days": 0, "holiday_pre_post_num_dict": None, "daily_event_df_dict": None, + "daily_event_neighbor_impact": None, + "daily_event_shifted_effect": None }, changepoints={ "auto_growth": False, @@ -558,12 +609,13 @@ def test_get_single_model_components_param_from_template(): "regression_weight_col": None, "min_admissible_value": None, "max_admissible_value": None, + "remove_intercept": False, "normalize_method": "zero_to_one" } ) -def test_get_model_components_from_model_template(silverkite, silverkite_diagnostics): +def test_get_model_components_from_model_template(silverkite): """Tests get_model_components_from_model_template and get_model_components_and_override_from_model_template.""" sst = SimpleSilverkiteTemplate() model_components = sst.get_model_components_from_model_template("SILVERKITE")[0] @@ -586,6 +638,8 @@ def test_get_model_components_from_model_template(silverkite, silverkite_diagnos "holiday_post_num_days": 2, "holiday_pre_post_num_dict": None, "daily_event_df_dict": None, + "daily_event_neighbor_impact": None, + "daily_event_shifted_effect": None } assert model_components.changepoints == { "auto_growth": False, @@ -628,6 +682,7 @@ def test_get_model_components_from_model_template(silverkite, silverkite_diagnos "regression_weight_col": None, "min_admissible_value": None, "max_admissible_value": None, + "remove_intercept": False, "normalize_method": "zero_to_one" } assert model_components.hyperparameter_override is None @@ -695,7 +750,6 @@ def test_get_model_components_from_model_template(silverkite, silverkite_diagnos hyperparameter_override={ "input__response__null__max_frac": 0.1, "estimator__silverkite": silverkite, - "estimator__silverkite_diagnostics": silverkite_diagnostics, } ) original_components = dataclasses.replace(model_components) # creates a copy @@ -721,7 +775,9 @@ def test_get_model_components_from_model_template(silverkite, silverkite_diagnos "holiday_pre_num_days": 3, "holiday_post_num_days": 2, "holiday_pre_post_num_dict": {"New Year's Day": (7, 3)}, - "daily_event_df_dict": daily_event_df_dict + "daily_event_df_dict": daily_event_df_dict, + "daily_event_neighbor_impact": None, + "daily_event_shifted_effect": None }) assert updated_components.changepoints == { "auto_growth": False, @@ -765,12 +821,12 @@ def test_get_model_components_from_model_template(silverkite, silverkite_diagnos "regression_weight_col": None, "min_admissible_value": None, "max_admissible_value": None, + "remove_intercept": False, "normalize_method": "zero_to_one" } assert updated_components.hyperparameter_override == { "input__response__null__max_frac": 0.1, "estimator__silverkite": silverkite, - "estimator__silverkite_diagnostics": silverkite_diagnostics } # test change point features model_components = ModelComponentsParam( @@ -796,7 +852,7 @@ def test_get_model_components_from_model_template(silverkite, silverkite_diagnos } -def test_override_model_components(silverkite, silverkite_diagnostics): +def test_override_model_components(silverkite): sst = SimpleSilverkiteTemplate() default_model_components = ModelComponentsParam( seasonality={ @@ -869,7 +925,6 @@ def test_override_model_components(silverkite, silverkite_diagnostics): }, hyperparameter_override={ "estimator__silverkite": silverkite, - "estimator__silverkite_diagnostics": silverkite_diagnostics, "estimator__daily_seasonality": 10 }) new_model_components = sst._SimpleSilverkiteTemplate__override_model_components( @@ -936,7 +991,6 @@ def test_override_model_components(silverkite, silverkite_diagnostics): }, hyperparameter_override={ "estimator__silverkite": silverkite, - "estimator__silverkite_diagnostics": silverkite_diagnostics, "estimator__daily_seasonality": 10 }) # Test None model component. @@ -1047,6 +1101,8 @@ def test_get_model_components_and_override_from_model_template_single(): "holiday_post_num_days": 0, "holiday_pre_post_num_dict": None, "daily_event_df_dict": None, + "daily_event_neighbor_impact": None, + "daily_event_shifted_effect": None }, custom={ "feature_sets_enabled": "auto", @@ -1062,6 +1118,7 @@ def test_get_model_components_and_override_from_model_template_single(): "regression_weight_col": None, "min_admissible_value": None, "max_admissible_value": None, + "remove_intercept": False, "normalize_method": "zero_to_one" }, autoregression={ @@ -1281,6 +1338,8 @@ def test_apply_default_model_components_daily_1(): estimator__holiday_post_num_days=[2], estimator__holiday_pre_post_num_dict=[None], estimator__daily_event_df_dict=[None], + estimator__daily_event_neighbor_impact=[None], + estimator__daily_event_shifted_effect=[None], # Feature sets estimator__feature_sets_enabled=["auto"], # Fit algorithm @@ -1303,6 +1362,7 @@ def test_apply_default_model_components_daily_1(): estimator__fast_simulation=[False], estimator__regressor_cols=[[]], estimator__lagged_regressor_dict=[None], + estimator__remove_intercept=[False], estimator__uncertainty_dict=[None], estimator__time_properties=[None], estimator__origin_for_time_vars=[None], @@ -1339,6 +1399,8 @@ def test_apply_default_model_components_daily_1(): estimator__holiday_post_num_days=[2], estimator__holiday_pre_post_num_dict=[None], estimator__daily_event_df_dict=[None], + estimator__daily_event_neighbor_impact=[None], + estimator__daily_event_shifted_effect=[None], # Feature sets estimator__feature_sets_enabled=["auto"], # Fit algorithm @@ -1361,6 +1423,7 @@ def test_apply_default_model_components_daily_1(): estimator__fast_simulation=[False], estimator__regressor_cols=[[]], estimator__lagged_regressor_dict=[None], + estimator__remove_intercept=[False], estimator__uncertainty_dict=[None], estimator__time_properties=[None], estimator__origin_for_time_vars=[None], @@ -1397,6 +1460,8 @@ def test_apply_default_model_components_daily_1(): estimator__holiday_post_num_days=[2], estimator__holiday_pre_post_num_dict=[None], estimator__daily_event_df_dict=[None], + estimator__daily_event_neighbor_impact=[None], + estimator__daily_event_shifted_effect=[None], # Feature sets estimator__feature_sets_enabled=["auto"], # Fit algorithm @@ -1419,6 +1484,7 @@ def test_apply_default_model_components_daily_1(): estimator__fast_simulation=[False], estimator__regressor_cols=[[]], estimator__lagged_regressor_dict=[None], + estimator__remove_intercept=[False], estimator__uncertainty_dict=[None], estimator__time_properties=[None], estimator__origin_for_time_vars=[None], @@ -1470,6 +1536,8 @@ def test_apply_default_model_components_daily_90(): estimator__holiday_post_num_days=[2], estimator__holiday_pre_post_num_dict=[None], estimator__daily_event_df_dict=[None], + estimator__daily_event_neighbor_impact=[None], + estimator__daily_event_shifted_effect=[None], # Feature sets estimator__feature_sets_enabled=["auto"], # Fit algorithm @@ -1492,6 +1560,7 @@ def test_apply_default_model_components_daily_90(): estimator__fast_simulation=[False], estimator__regressor_cols=[[]], estimator__lagged_regressor_dict=[None], + estimator__remove_intercept=[False], estimator__uncertainty_dict=[None] ), # Config 2 @@ -1522,6 +1591,8 @@ def test_apply_default_model_components_daily_90(): estimator__holiday_post_num_days=[2], estimator__holiday_pre_post_num_dict=[None], estimator__daily_event_df_dict=[None], + estimator__daily_event_neighbor_impact=[None], + estimator__daily_event_shifted_effect=[None], # Feature sets estimator__feature_sets_enabled=["auto"], # Fit algorithm @@ -1544,6 +1615,7 @@ def test_apply_default_model_components_daily_90(): estimator__fast_simulation=[False], estimator__regressor_cols=[[]], estimator__lagged_regressor_dict=[None], + estimator__remove_intercept=[False], estimator__uncertainty_dict=[None] ), # Config 3 @@ -1582,6 +1654,8 @@ def test_apply_default_model_components_daily_90(): estimator__holiday_post_num_days=[2], estimator__holiday_pre_post_num_dict=[None], estimator__daily_event_df_dict=[None], + estimator__daily_event_neighbor_impact=[None], + estimator__daily_event_shifted_effect=[None], # Feature sets estimator__feature_sets_enabled=["auto"], # Fit algorithm @@ -1604,6 +1678,7 @@ def test_apply_default_model_components_daily_90(): estimator__fast_simulation=[False], estimator__regressor_cols=[[]], estimator__lagged_regressor_dict=[None], + estimator__remove_intercept=[False], estimator__uncertainty_dict=[None] ), # Config 4 @@ -1642,6 +1717,8 @@ def test_apply_default_model_components_daily_90(): estimator__holiday_post_num_days=[4], estimator__holiday_pre_post_num_dict=[None], estimator__daily_event_df_dict=[None], + estimator__daily_event_neighbor_impact=[None], + estimator__daily_event_shifted_effect=[None], # Feature sets estimator__feature_sets_enabled=["auto"], # Fit algorithm @@ -1664,6 +1741,7 @@ def test_apply_default_model_components_daily_90(): estimator__fast_simulation=[False], estimator__regressor_cols=[[]], estimator__lagged_regressor_dict=[None], + estimator__remove_intercept=[False], estimator__uncertainty_dict=[None] ) ] @@ -1703,6 +1781,8 @@ def test_apply_default_model_components_weekly(): estimator__holiday_post_num_days=[0], estimator__holiday_pre_post_num_dict=[None], estimator__daily_event_df_dict=[None], + estimator__daily_event_neighbor_impact=[None], + estimator__daily_event_shifted_effect=[None], # Feature sets estimator__feature_sets_enabled=[False], # Fit algorithm @@ -1725,6 +1805,7 @@ def test_apply_default_model_components_weekly(): estimator__fast_simulation=[False], estimator__regressor_cols=[[]], estimator__lagged_regressor_dict=[None], + estimator__remove_intercept=[False], estimator__uncertainty_dict=[None] ), # Config 2 @@ -1763,6 +1844,8 @@ def test_apply_default_model_components_weekly(): estimator__holiday_post_num_days=[0], estimator__holiday_pre_post_num_dict=[None], estimator__daily_event_df_dict=[None], + estimator__daily_event_neighbor_impact=[None], + estimator__daily_event_shifted_effect=[None], # Feature sets estimator__feature_sets_enabled=[False], # Fit algorithm @@ -1785,6 +1868,7 @@ def test_apply_default_model_components_weekly(): estimator__fast_simulation=[False], estimator__regressor_cols=[[]], estimator__lagged_regressor_dict=[None], + estimator__remove_intercept=[False], estimator__uncertainty_dict=[None] ), # Config 3 @@ -1823,6 +1907,8 @@ def test_apply_default_model_components_weekly(): estimator__holiday_post_num_days=[0], estimator__holiday_pre_post_num_dict=[None], estimator__daily_event_df_dict=[None], + estimator__daily_event_neighbor_impact=[None], + estimator__daily_event_shifted_effect=[None], # Feature sets estimator__feature_sets_enabled=[False], # Fit algorithm @@ -1845,6 +1931,7 @@ def test_apply_default_model_components_weekly(): estimator__fast_simulation=[False], estimator__regressor_cols=[[]], estimator__lagged_regressor_dict=[None], + estimator__remove_intercept=[False], estimator__uncertainty_dict=[None] ), # Config 4 @@ -1883,6 +1970,8 @@ def test_apply_default_model_components_weekly(): estimator__holiday_post_num_days=[0], estimator__holiday_pre_post_num_dict=[None], estimator__daily_event_df_dict=[None], + estimator__daily_event_neighbor_impact=[None], + estimator__daily_event_shifted_effect=[None], # Feature sets estimator__feature_sets_enabled=[False], # Fit algorithm @@ -1905,6 +1994,7 @@ def test_apply_default_model_components_weekly(): estimator__fast_simulation=[False], estimator__regressor_cols=[[]], estimator__lagged_regressor_dict=[None], + estimator__remove_intercept=[False], estimator__uncertainty_dict=[None] ) ] @@ -1923,11 +2013,10 @@ def test_apply_default_model_template_hourly(): template.get_hyperparameter_grid() -def test_get_simple_silverkite_hyperparameter_grid(silverkite, silverkite_diagnostics): +def test_get_simple_silverkite_hyperparameter_grid(silverkite): """Tests get_silverkite_hyperparameter_grid""" # tests default values, unpacking, conversion to list silverkite = SimpleSilverkiteForecast() - silverkite_diagnostics = SilverkiteDiagnostics() template = SimpleSilverkiteTemplate() template.config = template.apply_forecast_config_defaults() hyperparameter_grid = template.get_hyperparameter_grid() @@ -1944,6 +2033,8 @@ def test_get_simple_silverkite_hyperparameter_grid(silverkite, silverkite_diagno "estimator__holiday_post_num_days": [2], "estimator__holiday_pre_post_num_dict": [None], "estimator__daily_event_df_dict": [None], + "estimator__daily_event_neighbor_impact": [None], + "estimator__daily_event_shifted_effect": [None], "estimator__auto_growth": [False], "estimator__changepoints_dict": [{ "method": "auto", @@ -1977,7 +2068,8 @@ def test_get_simple_silverkite_hyperparameter_grid(silverkite, silverkite_diagno "estimator__extra_pred_cols": [[]], "estimator__drop_pred_cols": [None], "estimator__explicit_pred_cols": [None], - "estimator__regression_weight_col": [None] + "estimator__regression_weight_col": [None], + "estimator__remove_intercept": [False] } assert_equal(hyperparameter_grid, expected_grid) @@ -2009,7 +2101,6 @@ def test_get_simple_silverkite_hyperparameter_grid(silverkite, silverkite_diagno hyperparameter_override={ "input__response__null__max_frac": 0.1, "estimator__silverkite": silverkite, - "estimator__silverkite_diagnostics": silverkite_diagnostics, "estimator__growth_term": ["override_estimator__growth_term"], "estimator__extra_pred_cols": ["override_estimator__extra_pred_cols"] } @@ -2035,7 +2126,6 @@ def test_get_simple_silverkite_hyperparameter_grid(silverkite, silverkite_diagno updated_grid["estimator__weekly_seasonality"] = [False] updated_grid["input__response__null__max_frac"] = [0.1] updated_grid["estimator__silverkite"] = [silverkite] - updated_grid["estimator__silverkite_diagnostics"] = [silverkite_diagnostics] updated_grid["estimator__growth_term"] = ["override_estimator__growth_term"] updated_grid["estimator__regressor_cols"] = [["reg1", "reg2"]] updated_grid["estimator__extra_pred_cols"] = [["override_estimator__extra_pred_cols"]] @@ -2542,7 +2632,7 @@ def test_run_template_3(): score_func=metric.name, greater_is_better=False) - # Note that for newer scikit-learn version, needs to add a check for ValueError, matching "model is misconfigured" + # Note that for newer scikit-learn (1.1+), we need to add a check for ValueError, matching "model is misconfigured" with pytest.raises((ValueError, KeyError)) as exception_info, pytest.warns( UserWarning, match="Removing the columns from the input list of 'regressor_cols'" @@ -2626,7 +2716,7 @@ def test_run_template_4(): assert all(param in list(grid_results["params"]) for param in expected_params) assert result.grid_search.best_index_ == 2 assert result.backtest.test_evaluation[rmse] == pytest.approx(5.425, rel=1e-2) - assert result.backtest.test_evaluation[q80] == pytest.approx(1.048, rel=1e-2) + assert result.backtest.test_evaluation[q80] == pytest.approx(1.036, rel=1e-2) assert result.forecast.train_evaluation[rmse] == pytest.approx(2.526, rel=1e-2) assert result.forecast.train_evaluation[q80] == pytest.approx(0.991, rel=1e-2) check_forecast_pipeline_result( @@ -2886,7 +2976,7 @@ def test_run_template_8(): config=config, ) rmse = EvaluationMetricEnum.RootMeanSquaredError.get_metric_name() - assert result.backtest.test_evaluation[rmse] == pytest.approx(6.691, rel=1e-1) + assert result.backtest.test_evaluation[rmse] == pytest.approx(6.123, rel=1e-1) check_forecast_pipeline_result( result, coverage=0.9, @@ -2981,7 +3071,7 @@ def test_run_template_9(): rmse = EvaluationMetricEnum.RootMeanSquaredError.get_metric_name() q80 = EvaluationMetricEnum.Quantile80.get_metric_name() assert result.backtest.test_evaluation[rmse] == pytest.approx(3.360, rel=1e-2) - assert result.backtest.test_evaluation[q80] == pytest.approx(1.139, rel=1e-2) + assert result.backtest.test_evaluation[q80] == pytest.approx(1.124, rel=1e-2) assert result.forecast.train_evaluation[rmse] == pytest.approx(2.069, rel=1e-2) assert result.forecast.train_evaluation[q80] == pytest.approx(0.771, rel=1e-2) check_forecast_pipeline_result( @@ -3013,7 +3103,7 @@ def test_run_template_9(): assert expected_pred_cols.issubset(actual_pred_cols) assert expected_x_mat_cols.issubset(actual_x_mat_cols) - # Note that for newer scikit-learn version, needs to add a check for ValueError, matching "model is misconfigured" + # Note that for newer scikit-learn (1.1+), we need to add a check for ValueError, matching "model is misconfigured" with pytest.raises((ValueError, KeyError)) as exception_info: model_components = ModelComponentsParam( regressors={ @@ -3029,7 +3119,7 @@ def test_run_template_9(): info_str = str(exception_info.value) assert "missing_regressor" in info_str or "model is misconfigured" in info_str - # Note that for newer scikit-learn version, needs to add a check for ValueError, matching "model is misconfigured" + # Note that for newer scikit-learn (1.1+), we need to add a check for ValueError, matching "model is misconfigured" with pytest.raises((ValueError, KeyError)) as exception_info: model_components = ModelComponentsParam( lagged_regressors={ @@ -3047,7 +3137,7 @@ def test_run_template_9(): info_str = str(exception_info.value) assert "missing_lagged_regressor" in info_str or "model is misconfigured" in info_str - # Note that for newer scikit-learn version, needs to add a check for ValueError, matching "model is misconfigured" + # Note that for newer scikit-learn (1.1+), we need to add a check for ValueError, matching "model is misconfigured" with pytest.raises((ValueError, KeyError)) as exception_info: model_components = ModelComponentsParam( lagged_regressors={ @@ -3067,6 +3157,137 @@ def test_run_template_9(): assert "missing_lagged_regressor" in info_str or "model is misconfigured" in info_str +def test_run_template_with_event_indicators(): + """Tests event indicators (is_event, is_event_exact, is_event_adjacent) and + its interactions with daily data.""" + data = generate_df_with_reg_for_tests( + freq="D", + periods=365*3, + remove_extra_cols=True, + mask_test_actuals=True + ) + df = data["df"] # non-NA values from 2018-07-01 to 2020-11-23 + time_col = "some_time_col" + value_col = "some_value_col" + df.rename({ + TIME_COL: time_col, + VALUE_COL: value_col + }, axis=1, inplace=True) + + metadata = MetadataParam( + time_col=time_col, + value_col=value_col, + freq="D", + date_format="%Y-%m-%d", + train_end_date=datetime.datetime(2020, 9, 7) # 2020-09-07 is Labor Day (Monday) + ) + evaluation_metric = EvaluationMetricParam( + cv_selection_metric=EvaluationMetricEnum.MedianAbsolutePercentError.name, + cv_report_metrics=None, + ) + evaluation_period = EvaluationPeriodParam( + test_horizon=1, + periods_between_train_test=0, + cv_max_splits=0 + ) + + model_components = ModelComponentsParam( + seasonality={ + "yearly_seasonality": True, + "weekly_seasonality": False + }, + growth={ + "growth_term": "quadratic" + }, + events={ + "holidays_to_model_separately": "ALL_HOLIDAYS_IN_COUNTRIES", + "holiday_lookup_countries": ["UnitedStates"], + "holiday_pre_num_days": 1, + }, + changepoints={ + "changepoints_dict": { + "method": "uniform", + "n_changepoints": 20, + } + }, + autoregression={ + "autoreg_dict": "auto" + }, + regressors={ + "regressor_cols": [ + "regressor1", + "regressor2", + "regressor3", + "regressor_bool", + "regressor_categ" + ] + }, + lagged_regressors=None, + uncertainty={ + "uncertainty_dict": "auto", + }, + hyperparameter_override=None, + custom={ + "fit_algorithm_dict": { + "fit_algorithm": "ridge" + }, + "feature_sets_enabled": "auto", + "extra_pred_cols": ["is_event:is_weekend", "is_event_adjacent", "is_event_exact"] + } + ) + + forecast_horizon = 1 + coverage = 0.9 + config = ForecastConfig( + model_template=ModelTemplateEnum.SILVERKITE.name, + metadata_param=metadata, + forecast_horizon=forecast_horizon, + coverage=coverage, + evaluation_metric_param=evaluation_metric, + evaluation_period_param=evaluation_period, + model_components_param=model_components, + ) + + forecaster = Forecaster() + result = forecaster.run_forecast_config(df=df, config=config) + + # Checks event indicators are correctly modeled + model = result.model[-1] + x_mat = model.model_dict["x_mat"] + expected_event_cols = ["is_event_exact", "is_event_adjacent", "is_event:is_weekend[True]", "is_event:is_weekend[False]"] + actual_event_cols = [] + for col in x_mat.columns: + if "is_event" in col: + actual_event_cols.append(col) + assert set(expected_event_cols) == set(actual_event_cols) + # 2020-09-07 is Labor Day and on Monday + assert np.array_equal(x_mat[expected_event_cols].iloc[-1].values, [1, 0, 0, 1]) + assert np.array_equal(x_mat[expected_event_cols].iloc[-2].values, [0, 1, 1, 0]) + + +def test_run_template_with_infinite_values(): + """Tests template with null and infinite values in data""" + data = generate_df_for_tests( + freq="D", + periods=400) + df = data["train_df"] + # Introduces null and infinity values + df[VALUE_COL].iloc[100:110] = np.nan + df[VALUE_COL].iloc[150:160] = np.inf + + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + result = Forecaster().run_forecast_config( + df=df, + ) + check_forecast_pipeline_result( + result, + coverage=None, + strategy=None, + score_func=EvaluationMetricEnum.MeanAbsolutePercentError.name, + greater_is_better=False) + + def test_run_template_daily_1(): dl = DataLoader() data = dl.load_peyton_manning() @@ -3408,5 +3629,121 @@ def test_silverkite_auto_config(): "order": [3, 1, 1, 6], "seas_names": ["weekly", "monthly", "quarterly", "yearly"] })) - assert len(result.model[-1].model_dict["daily_event_df_dict"]) == 194 + assert len(result.model[-1].model_dict["daily_event_df_dict"]) == 199 assert "ct1" in result.model[-1].model_dict["x_mat"].columns + + +def test_holiday_neighbor_impact(): + """Tests holidays in weekly data.""" + df = pd.DataFrame({ + "ts": pd.date_range("2020-01-01", freq="W-SUN", periods=100), + "y": np.random.randn(100) + }) + config = ForecastConfig( + model_template="SILVERKITE", + forecast_horizon=1, + metadata_param=MetadataParam( + freq="W-SUN" + ), + evaluation_period_param=EvaluationPeriodParam( + test_horizon=0, + cv_horizon=0 + ), + model_components_param=ModelComponentsParam( + events=dict( + daily_event_neighbor_impact=lambda x: [x - timedelta(days=x.isocalendar()[2] - 1) + + timedelta(days=i) for i in range(7)] + ), + ) + ) + result = Forecaster().run_forecast_config( + df=df, + config=config, + ) + # The first day is "2020-01-05". + # Because we set the ``daily_event_neighbor_impact`` parameter, + # it is marked as New Year's day. + assert result.model[-1].model_dict["x_mat"][ + "C(Q('events_New Years Day'), levels=['', 'event'])[T.event]"].iloc[0] == 1 + + +def test_various_models(forecast_data): + """This is a light-weight performance check (in terms of forecast accuracy) + on the user-facing forecasting layer. + This is done across a few algorithms and data sets (two real data sets and one simulated data set). + This is not intended to replace, comprehensive benchmarking, rather its there to + capture major code changes which might impact the final results. + """ + # Specify dataset information + metadata = MetadataParam( + time_col="ts", + value_col="y", + freq="D") + + def fit_forecast( + df_label, + fit_algorithm): + """Fits a forecast and calculates cross validation errors for a given + data set and ``fit_algorithm``. + It returns a tuple: (Test MAPE, Feature number) + """ + cv_min_train_periods = 500 + # Let CV use most recent splits for cross-validation. + cv_use_most_recent_splits = True + # Determine the maximum number of validations. + cv_max_splits = 2 + forecast_horizon = 14 + evaluation_period_param = EvaluationPeriodParam( + test_horizon=forecast_horizon, + cv_horizon=forecast_horizon, + periods_between_train_test=0, + cv_min_train_periods=cv_min_train_periods, + cv_expanding_window=True, + cv_use_most_recent_splits=cv_use_most_recent_splits, + cv_periods_between_splits=13, + cv_periods_between_train_test=0, + cv_max_splits=cv_max_splits, + ) + + model_components = ModelComponentsParam( + custom=dict( + fit_algorithm_dict=dict( + fit_algorithm=fit_algorithm))) + + forecaster = Forecaster() + result = forecaster.run_forecast_config( + df=forecast_data[df_label], + config=ForecastConfig( + model_template=ModelTemplateEnum.SILVERKITE.name, + forecast_horizon=forecast_horizon, + coverage=0.95, + evaluation_period_param=evaluation_period_param, + metadata_param=metadata, + model_components_param=model_components)) + + grid_search = result.grid_search + cv_results = summarize_grid_search_results( + grid_search=grid_search, + decimals=2, + cv_report_metrics=None) + test_mape = cv_results["mean_test_MAPE"].values[0] + features_num = len(result.model[-1].model_dict["pred_cols"]) + + return (test_mape, features_num) + + err_feature_num_dict = {} + with warnings.catch_warnings(): + warnings.filterwarnings("ignore", category=ConvergenceWarning) + for df_label in forecast_data.keys(): + for fit_algorithm in ["ridge", "elastic_net"]: + err_feature_num_dict[f"{df_label}_{fit_algorithm}"] = fit_forecast( + df_label, + fit_algorithm) + + assert err_feature_num_dict == { + "daily_pt_ridge": (1.8, 131), + "daily_pt_elastic_net": (1.84, 131), + "daily_bk_ridge": (37.2, 124), + "daily_bk_elastic_net": (34.41, 124), + "sim_ridge": (1.66, 131), + "sim_elastic_net": (1.42, 131)} diff --git a/greykite/tests/sklearn/estimator/test_auto_arima_estimator.py b/greykite/tests/sklearn/estimator/test_auto_arima_estimator.py index a548cf5..9af9e80 100644 --- a/greykite/tests/sklearn/estimator/test_auto_arima_estimator.py +++ b/greykite/tests/sklearn/estimator/test_auto_arima_estimator.py @@ -199,7 +199,7 @@ def test_score_function(daily_data): time_col = "ts" model.fit(train_df, time_col=time_col, value_col=value_col) score = model.score(daily_data["test_df"], daily_data["test_df"][value_col]) - assert score < 8.0 + assert score < 9.0 def test_summary(daily_data): diff --git a/greykite/tests/sklearn/estimator/test_base_silverkite_estimator.py b/greykite/tests/sklearn/estimator/test_base_silverkite_estimator.py index 02387b6..e3cfe27 100644 --- a/greykite/tests/sklearn/estimator/test_base_silverkite_estimator.py +++ b/greykite/tests/sklearn/estimator/test_base_silverkite_estimator.py @@ -17,6 +17,7 @@ from greykite.common.testing_utils import generate_df_for_tests from greykite.common.testing_utils import generate_df_with_reg_for_tests from greykite.sklearn.estimator.base_silverkite_estimator import BaseSilverkiteEstimator +from greykite.sklearn.estimator.testing_utils import params_component_breakdowns from greykite.sklearn.estimator.testing_utils import params_components @@ -88,6 +89,65 @@ def df_pt(): return dl.load_peyton_manning() +@pytest.fixture +def expected_component_names(): + return params_component_breakdowns()["expected_component_names"] + + +@pytest.fixture(scope="module") +def test_params(): + time_col = "ts" + # value_col name is chosen such that it contains keywords "ct" and "sin" + # so that we can test patterns specified for each component work correctly + value_col = "basin_impact" + return dict( + time_col=time_col, + value_col=value_col, + df=pd.DataFrame({ + time_col: [ + datetime.datetime(2018, 1, 1), + datetime.datetime(2018, 1, 2), + datetime.datetime(2018, 1, 3), + datetime.datetime(2018, 1, 4), + datetime.datetime(2018, 1, 5)], + value_col: [10, 10, 10, 10, 10], + "dummy_col": [0, 0, 0, 0, 0], + }), + feature_df=pd.DataFrame({ + # Trend columns: growth, changepoints and interactions (total 5 columns) + "ct1": np.array([1.0, 1.0, 1.0, 1.0, 1.0]), + "ct1:tod": np.array([1.0, 1.0, 1.0, 1.0, 1.0]), + "ct_sqrt": np.array([1.0, 1.0, 1.0, 1.0, 1.0]), + "changepoint0_2018_01_02_00": np.array([1.0, 1.0, 1.0, 1.0, 1.0]), + "changepoint1_2018_01_04_00": np.array([1.0, 1.0, 1.0, 1.0, 1.0]), + # Lag columns: autoregression, lagged regressor + f"{value_col}_lag1": np.array([4.0, 4.0, 4.0, 4.0, 4.0]), + f"{value_col}_avglag_3_5": np.array([4.0, 4.0, 4.0, 4.0, 4.0]), + "regressor1_lag1": np.array([4.0, 4.0, 4.0, 4.0, 4.0]), + "regressor_categ_avglag_4_6": np.array([4.0, 4.0, 4.0, 4.0, 4.0]), + # Daily seasonality with interaction (total 4 columns) + "sin1_tow_weekly": np.array([2.0, 2.0, 2.0, 2.0, 2.0]), + "cos1_tow_weekly": np.array([2.0, 2.0, 2.0, 2.0, 2.0]), + "is_weekend[T.True]:sin1_tow_weekly": np.array([2.0, 2.0, 2.0, 2.0, 2.0]), + "is_weekend[T.True]:cos1_tow_weekly": np.array([2.0, 2.0, 2.0, 2.0, 2.0]), + # Yearly seasonality (total 6 columns) + "sin1_ct1_yearly": np.array([3.0, 3.0, 3.0, 3.0, 3.0]), + "cos1_ct1_yearly": np.array([3.0, 3.0, 3.0, 3.0, 3.0]), + "sin2_ct1_yearly": np.array([3.0, 3.0, 3.0, 3.0, 3.0]), + "cos2_ct1_yearly": np.array([3.0, 3.0, 3.0, 3.0, 3.0]), + "sin3_ct1_yearly": np.array([3.0, 3.0, 3.0, 3.0, 3.0]), + "cos3_ct1_yearly": np.array([3.0, 3.0, 3.0, 3.0, 3.0]), + # Holiday with pre and post effect (1 at the where the date and event match) + # e.g. New Years Day is 1 at 1st January, 0 rest of the days + "Q('events_New Years Day')[T.event]": np.array([1.0, 0.0, 0.0, 0.0, 0.0]), + "Q('events_New Years Day_minus_1')[T.event]": np.array([0.0, 0.0, 0.0, 0.0, 0.0]), + "Q('events_New Years Day_minus_2')[T.event]": np.array([0.0, 0.0, 0.0, 0.0, 0.0]), + "Q('events_New Years Day_plus_1')[T.event]": np.array([0.0, 1.0, 0.0, 0.0, 0.0]), + "Q('events_New Years Day_plus_2')[T.event]": np.array([0.0, 0.0, 1.0, 0.0, 0.0]), + }) + ) + + def test_init(params): """Checks if parameters are passed to BaseSilverkiteEstimator correctly""" coverage = 0.95 @@ -382,13 +442,11 @@ def test_summary(daily_data): model.summary() -def test_silverkite_with_components_daily_data(): - """Tests get_components, plot_components, plot_trend, - plot_seasonalities with daily data and missing input values. - """ +def test_silverkite_with_components_daily_data(expected_component_names): + """Tests ``plot_components`` with daily data and missing input values.""" daily_data = generate_df_with_reg_for_tests( freq="D", - periods=20, + periods=35, train_start_date=datetime.datetime(2018, 1, 1), conti_year_origin=2018) train_df = daily_data["train_df"].copy() @@ -415,7 +473,7 @@ def test_silverkite_with_components_daily_data(): with pytest.warns(Warning): # suppress warnings from conf_interval.py and sklearn - # a subclass's fit() method will have these steps + # a subclass's `fit()` method will have these steps model.fit( X=train_df, time_col=cst.TIME_COL, @@ -428,71 +486,61 @@ def test_silverkite_with_components_daily_data(): **params_daily) model.finish_fit() - # Tests plot_components - with pytest.warns(Warning) as record: - title = "Custom component plot" - model._set_silverkite_diagnostics_params() - fig = model.plot_components(names=["trend", "YEARLY_SEASONALITY", "DUMMY"], title=title) - expected_rows = 3 - assert len(fig.data) == expected_rows + 1 # includes changepoints - assert [fig.data[i].name for i in range(expected_rows)] == \ - [cst.VALUE_COL, "trend", "YEARLY_SEASONALITY"] - - assert fig.layout.xaxis.title["text"] == cst.TIME_COL - assert fig.layout.xaxis2.title["text"] == cst.TIME_COL - assert fig.layout.xaxis3.title["text"] == "Time of year" - - assert fig.layout.yaxis.title["text"] == cst.VALUE_COL - assert fig.layout.yaxis2.title["text"] == "trend" - assert fig.layout.yaxis3.title["text"] == "yearly" - - assert fig.layout.title["text"] == title - assert fig.layout.title["x"] == 0.5 - assert f"The following components have not been specified in the model: " \ - f"{{'DUMMY'}}, plotting the rest." in record[0].message.args[0] - - # Missing component error + # Tests `plot_components` + title = "Custom component plot" + fig = model.plot_components(title=title) + expected_rows = 10 + assert len(fig.data) == expected_rows # includes changepoints + assert all([fig.data[i].name in expected_component_names for i in range(expected_rows)]) + assert fig.layout.xaxis.title["text"] == "Date" + assert fig.layout.title["text"] == title + assert fig.layout.title["x"] == 0.5 + with pytest.raises( ValueError, - match="None of the provided components have been specified in the model."): - model.plot_components(names=["DUMMY"]) - - # Tests plot_trend - title = "Custom trend plot" - fig = model.plot_trend(title=title) - expected_rows = 2 - assert len(fig.data) == expected_rows + 1 # includes changepoints - assert [fig.data[i].name for i in range(expected_rows)] == [cst.VALUE_COL, "trend"] - - assert fig.layout.xaxis.title["text"] == cst.TIME_COL - assert fig.layout.xaxis2.title["text"] == cst.TIME_COL - - assert fig.layout.yaxis.title["text"] == cst.VALUE_COL - assert fig.layout.yaxis2.title["text"] == "trend" + match="Call the predict"): + model.plot_components(predict_phase=True) + + # Tests `plot_components(predict_phase=True)` + _ = model.predict(X=daily_data['test_df']) + fig = model.plot_components(predict_phase=True) + expected_rows = 6 + assert len(fig.data) == expected_rows # includes changepoints + assert all([fig.data[i].name in expected_component_names for i in range(expected_rows)]) + assert fig.layout.xaxis.title["text"] == "Date" + assert fig.layout.title["text"] == "Component Plot - Predicted" + assert fig.layout.title["x"] == 0.5 - assert fig.layout.title["text"] == title + # Tests `plot_components(predict_phase=True)` with gap + _ = model.predict(X=daily_data['test_df'].iloc[4:, :]) + fig = model.plot_components(predict_phase=True) + expected_rows = 6 + assert len(fig.data) == expected_rows # includes changepoints + assert all([fig.data[i].name in expected_component_names for i in range(expected_rows)]) + assert fig.layout.xaxis.title["text"] == "Date" + assert fig.layout.title["text"] == "Component Plot - Predicted" assert fig.layout.title["x"] == 0.5 - # Tests plot_seasonalities - with pytest.warns(Warning): - # suppresses the warning on seasonalities removed - title = "Custom seasonality plot" - fig = model.plot_seasonalities(title=title) - expected_rows = 3 - assert len(fig.data) == expected_rows - assert [fig.data[i].name for i in range(expected_rows)] == \ - [cst.VALUE_COL, "WEEKLY_SEASONALITY", "YEARLY_SEASONALITY"] + # Tests inputs to `plot_components` + with pytest.raises( + TypeError, + match="center_components must be bool: True/False"): + model.plot_components(center_components="Yes please") - assert fig.layout.xaxis.title["text"] == cst.TIME_COL - assert fig.layout.xaxis2.title["text"] == "Day of week" - assert fig.layout.xaxis3.title["text"] == "Time of year" + with pytest.raises( + ValueError, + match="Choose denominator from"): + model.plot_components(denominator="Abs_Y") - assert fig.layout.yaxis.title["text"] == cst.VALUE_COL - assert fig.layout.yaxis2.title["text"] == "weekly" - assert fig.layout.yaxis3.title["text"] == "yearly" + with pytest.raises( + TypeError, + match="grouping_regex_patterns_dict must be"): + model.plot_components(grouping_regex_patterns_dict=[]) - assert fig.layout.title["text"] == title - assert fig.layout.title["x"] == 0.5 + with pytest.raises( + ValueError, + match="grouping_regex_patterns_dict must be"): + model.plot_components(grouping_regex_patterns_dict={}) # Component plot error if `fit_algorithm` is "rf" or "gradient_boosting" params_daily["fit_algorithm"] = "rf" @@ -501,7 +549,7 @@ def test_silverkite_with_components_daily_data(): uncertainty_dict=params_daily["uncertainty_dict"]) with pytest.warns(Warning): # suppress warnings from conf_interval.py and sklearn - # a subclass's fit() method will have these steps + # a subclass's `fit()` method will have these steps model.fit( X=train_df, time_col=cst.TIME_COL, @@ -518,28 +566,19 @@ def test_silverkite_with_components_daily_data(): match="Component plot has only been implemented for additive linear models."): model.plot_components() - with pytest.raises( - NotImplementedError, - match="Component plot has only been implemented for additive linear models."): - model.plot_trend() - - with pytest.raises( - NotImplementedError, - match="Component plot has only been implemented for additive linear models."): - model.plot_seasonalities() - -def test_silverkite_with_components_hourly_data(): - """Tests get_components, plot_components, plot_trend, - plot_seasonalities with hourly data - """ +def test_silverkite_with_components_hourly_data(expected_component_names): + """Tests ``plot_components`` with hourly data""" hourly_data = generate_df_with_reg_for_tests( freq="H", - periods=24 * 4, + periods=24 * 3, train_start_date=datetime.datetime(2018, 1, 1), conti_year_origin=2018) train_df = hourly_data.get("train_df").copy() params_hourly = params_components() + # Removes predictor column for dow by hour interaction + # "dow_hr" is unique in forecast period otherwise and causes an error + params_hourly["extra_pred_cols"].remove('dow_hr') # converts into parameters for `forecast_silverkite` coverage = params_hourly.pop("coverage") @@ -558,66 +597,35 @@ def test_silverkite_with_components_hourly_data(): **params_hourly) model.finish_fit() - # Test plot_components - with pytest.warns(Warning) as record: - title = "Custom component plot" - fig = model.plot_components(names=["trend", "DAILY_SEASONALITY", "DUMMY"], title=title) - expected_rows = 3 + 1 # includes changepoints - assert len(fig.data) == expected_rows - assert [fig.data[i].name for i in range(expected_rows)] == \ - [cst.VALUE_COL, "trend", "DAILY_SEASONALITY", "trend change point"] - - assert fig.layout.xaxis.title["text"] == cst.TIME_COL - assert fig.layout.xaxis2.title["text"] == cst.TIME_COL - assert fig.layout.xaxis3.title["text"] == "Hour of day" - - assert fig.layout.yaxis.title["text"] == cst.VALUE_COL - assert fig.layout.yaxis2.title["text"] == "trend" - assert fig.layout.yaxis3.title["text"] == "daily" - - assert fig.layout.title["text"] == title - assert fig.layout.title["x"] == 0.5 - assert f"The following components have not been specified in the model: " \ - f"{{'DUMMY'}}, plotting the rest." in record[0].message.args[0] - - # Test plot_trend - title = "Custom trend plot" - fig = model.plot_trend(title=title) - expected_rows = 2 - assert len(fig.data) == expected_rows + 1 # includes changepoints - assert [fig.data[i].name for i in range(expected_rows)] == [cst.VALUE_COL, "trend"] - - assert fig.layout.xaxis.title["text"] == cst.TIME_COL - assert fig.layout.xaxis2.title["text"] == cst.TIME_COL - - assert fig.layout.yaxis.title["text"] == cst.VALUE_COL - assert fig.layout.yaxis2.title["text"] == "trend" - + # Tests `plot_components` + title = "Custom component plot" + fig = model.plot_components(title=title) + expected_rows = 10 + assert len(fig.data) == expected_rows # includes changepoints + assert all([fig.data[i].name in expected_component_names for i in range(expected_rows)]) + assert fig.layout.xaxis.title["text"] == "Date" assert fig.layout.title["text"] == title assert fig.layout.title["x"] == 0.5 - # Test plot_seasonalities - with pytest.warns(Warning): - # suppresses the warning on seasonalities removed - title = "Custom seasonality plot" - fig = model.plot_seasonalities(title=title) - expected_rows = 4 - assert len(fig.data) == expected_rows - assert [fig.data[i].name for i in range(expected_rows)] == \ - [cst.VALUE_COL, "DAILY_SEASONALITY", "WEEKLY_SEASONALITY", "YEARLY_SEASONALITY"] - - assert fig.layout.xaxis.title["text"] == cst.TIME_COL - assert fig.layout.xaxis2.title["text"] == "Hour of day" - assert fig.layout.xaxis3.title["text"] == "Day of week" - assert fig.layout.xaxis4.title["text"] == "Time of year" - - assert fig.layout.yaxis.title["text"] == cst.VALUE_COL - assert fig.layout.yaxis2.title["text"] == "daily" - assert fig.layout.yaxis3.title["text"] == "weekly" - assert fig.layout.yaxis4.title["text"] == "yearly" - - assert fig.layout.title["text"] == title - assert fig.layout.title["x"] == 0.5 + # Tests `plot_components(predict_phase=True)` + _ = model.predict(X=hourly_data['test_df']) + fig = model.plot_components(predict_phase=True) + expected_rows = 6 + assert len(fig.data) == expected_rows # includes changepoints + assert all([fig.data[i].name in expected_component_names for i in range(expected_rows)]) + assert fig.layout.xaxis.title["text"] == "Date" + assert fig.layout.title["text"] == "Component Plot - Predicted" + assert fig.layout.title["x"] == 0.5 + + # Tests `plot_components(predict_phase=True)` with gap + _ = model.predict(X=hourly_data['test_df'].iloc[5:, :]) + fig = model.plot_components(predict_phase=True) + expected_rows = 6 + assert len(fig.data) == expected_rows # includes changepoints + assert all([fig.data[i].name in expected_component_names for i in range(expected_rows)]) + assert fig.layout.xaxis.title["text"] == "Date" + assert fig.layout.title["text"] == "Component Plot - Predicted" + assert fig.layout.title["x"] == 0.5 def test_plot_trend_changepoint_detection(df_pt): diff --git a/greykite/tests/sklearn/estimator/test_multistage_forecast_estimator.py b/greykite/tests/sklearn/estimator/test_multistage_forecast_estimator.py index 63edd52..09df41e 100644 --- a/greykite/tests/sklearn/estimator/test_multistage_forecast_estimator.py +++ b/greykite/tests/sklearn/estimator/test_multistage_forecast_estimator.py @@ -466,13 +466,13 @@ def test_missing_timestamps_during_aggregation(params, hourly_data_with_reg): drop_incomplete=True, index=0 ) - log_capture.check( + log_capture.check_present( (cst.LOGGER_NAME, "WARNING", "There are missing timestamps in `df` when performing aggregation with " - "frequency D. These points are ts y\nts " - "\n2018-01-03 23 23. " - "This may cause the aggregated values to be biased.") + "frequency D. These points are y\n" + "ts \n" + "2018-01-03 23. This may cause the aggregated values to be biased.") ) @@ -496,7 +496,7 @@ def test_short_fit_length(params): model = MultistageForecastEstimator(**params) with LogCapture(cst.LOGGER_NAME) as log_capture: model._initialize() - log_capture.check( + log_capture.check_present( (cst.LOGGER_NAME, "INFO", "Some `fit_length` is None or is shorter than `train_length`. " diff --git a/greykite/tests/sklearn/estimator/test_silverkite_diagnostics.py b/greykite/tests/sklearn/estimator/test_silverkite_diagnostics.py deleted file mode 100644 index 586d9c0..0000000 --- a/greykite/tests/sklearn/estimator/test_silverkite_diagnostics.py +++ /dev/null @@ -1,252 +0,0 @@ -import datetime - -import numpy as np -import pandas as pd -import pytest -from pandas.testing import assert_frame_equal - -from greykite.common import constants as cst -from greykite.common.features.timeseries_features import build_time_features_df -from greykite.sklearn.estimator.silverkite_diagnostics import SilverkiteDiagnostics - - -@pytest.fixture(scope="module") -def test_params(): - time_col = "ts" - # value_col name is chosen such that it contains keywords "ct" and "sin" - # so that we can test patterns specified for each component work correctly - value_col = "basin_impact" - return dict( - time_col=time_col, - value_col=value_col, - df=pd.DataFrame({ - time_col: [ - datetime.datetime(2018, 1, 1), - datetime.datetime(2018, 1, 2), - datetime.datetime(2018, 1, 3), - datetime.datetime(2018, 1, 4), - datetime.datetime(2018, 1, 5)], - value_col: [10, 10, 10, 10, 10], - "dummy_col": [0, 0, 0, 0, 0], - }), - feature_df=pd.DataFrame({ - # Trend columns: growth, changepoints and interactions (total 5 columns) - "ct1": np.array([1.0, 1.0, 1.0, 1.0, 1.0]), - "ct1:tod": np.array([1.0, 1.0, 1.0, 1.0, 1.0]), - "ct_sqrt": np.array([1.0, 1.0, 1.0, 1.0, 1.0]), - "changepoint0_2018_01_02_00": np.array([1.0, 1.0, 1.0, 1.0, 1.0]), - "changepoint1_2018_01_04_00": np.array([1.0, 1.0, 1.0, 1.0, 1.0]), - # Lag columns: autoregression, lagged regressor - f"{value_col}_lag1": np.array([4.0, 4.0, 4.0, 4.0, 4.0]), - f"{value_col}_avglag_3_5": np.array([4.0, 4.0, 4.0, 4.0, 4.0]), - "regressor1_lag1": np.array([4.0, 4.0, 4.0, 4.0, 4.0]), - "regressor_categ_avglag_4_6": np.array([4.0, 4.0, 4.0, 4.0, 4.0]), - # Daily seasonality with interaction (total 4 columns) - "sin1_tow_weekly": np.array([2.0, 2.0, 2.0, 2.0, 2.0]), - "cos1_tow_weekly": np.array([2.0, 2.0, 2.0, 2.0, 2.0]), - "is_weekend[T.True]:sin1_tow_weekly": np.array([2.0, 2.0, 2.0, 2.0, 2.0]), - "is_weekend[T.True]:cos1_tow_weekly": np.array([2.0, 2.0, 2.0, 2.0, 2.0]), - # Yearly seasonality (total 6 columns) - "sin1_ct1_yearly": np.array([3.0, 3.0, 3.0, 3.0, 3.0]), - "cos1_ct1_yearly": np.array([3.0, 3.0, 3.0, 3.0, 3.0]), - "sin2_ct1_yearly": np.array([3.0, 3.0, 3.0, 3.0, 3.0]), - "cos2_ct1_yearly": np.array([3.0, 3.0, 3.0, 3.0, 3.0]), - "sin3_ct1_yearly": np.array([3.0, 3.0, 3.0, 3.0, 3.0]), - "cos3_ct1_yearly": np.array([3.0, 3.0, 3.0, 3.0, 3.0]), - # Holiday with pre and post effect (1 at the where the date and event match) - # e.g. New Years Day is 1 at 1st January, 0 rest of the days - "Q('events_New Years Day')[T.event]": np.array([1.0, 0.0, 0.0, 0.0, 0.0]), - "Q('events_New Years Day_minus_1')[T.event]": np.array([0.0, 0.0, 0.0, 0.0, 0.0]), - "Q('events_New Years Day_minus_2')[T.event]": np.array([0.0, 0.0, 0.0, 0.0, 0.0]), - "Q('events_New Years Day_plus_1')[T.event]": np.array([0.0, 1.0, 0.0, 0.0, 0.0]), - "Q('events_New Years Day_plus_2')[T.event]": np.array([0.0, 0.0, 1.0, 0.0, 0.0]), - }) - ) - - -def test_get_silverkite_components(test_params): - """Tests get_silverkite_components function""" - time_col = test_params["time_col"] - value_col = test_params["value_col"] - df = test_params["df"] - feature_df = test_params["feature_df"] - - silverkite_diagnostics: SilverkiteDiagnostics = SilverkiteDiagnostics() - components = silverkite_diagnostics.get_silverkite_components(df, time_col, value_col, feature_df) - expected_residual = df[value_col].values - feature_df.sum(axis=1).values - expected_df = pd.DataFrame({ - time_col: df[time_col], - value_col: df[value_col], - "trend": 5 * np.array([1.0, 1.0, 1.0, 1.0, 1.0]), - "autoregression": 2 * np.array([4.0, 4.0, 4.0, 4.0, 4.0]), - "lagged_regressor": 2 * np.array([4.0, 4.0, 4.0, 4.0, 4.0]), - "WEEKLY_SEASONALITY": 4 * np.array([2.0, 2.0, 2.0, 2.0, 2.0]), - "YEARLY_SEASONALITY": 6 * np.array([3.0, 3.0, 3.0, 3.0, 3.0]), - cst.EVENT_PREFIX: np.array([1.0, 1.0, 1.0, 0.0, 0.0]), - "residual": expected_residual, - "trend_changepoints": np.array([0, 1, 0, 1, 0]), - }) - assert_frame_equal(components, expected_df) - - # Test error messages - with pytest.raises(ValueError, match="feature_df must be non-empty"): - silverkite_diagnostics.get_silverkite_components(df, time_col, value_col, feature_df=pd.DataFrame()) - - with pytest.raises(ValueError, match="df and feature_df must have same number of rows."): - silverkite_diagnostics.get_silverkite_components(df, time_col, value_col, feature_df=pd.DataFrame({"ts": [1, 2, 3]})) - - -def test_plot_silverkite_components(test_params): - """Tests plot_silverkite_components function""" - time_col = test_params["time_col"] - value_col = test_params["value_col"] - df = test_params["df"] - feature_df = test_params["feature_df"] - - silverkite_diagnostics: SilverkiteDiagnostics = SilverkiteDiagnostics() - components = silverkite_diagnostics.get_silverkite_components(df, time_col, value_col, feature_df) - - # Check plot_silverkite_components with defaults - fig = silverkite_diagnostics.plot_silverkite_components(components) - assert len(fig.data) == 8 + 2 # 2 changepoints - assert [fig.data[i].name for i in range(len(fig.data))] == list(components.columns)[1: -1] + ["trend change point"] * 2 - - assert fig.layout.height == (len(fig.data) - 2) * 350 # changepoints do not create separate subplots - assert fig.layout.showlegend is True # legend for changepoints - assert fig.layout.title["text"] == "Component plots" - assert fig.layout.title["x"] == 0.5 - - assert fig.layout.xaxis.title["text"] == time_col - assert fig.layout.xaxis2.title["text"] == time_col - assert fig.layout.xaxis3.title["text"] == time_col - assert fig.layout.xaxis4.title["text"] == time_col - assert fig.layout.xaxis5.title["text"] == "Day of week" - assert fig.layout.xaxis6.title["text"] == "Time of year" - assert fig.layout.xaxis7.title["text"] == time_col - assert fig.layout.xaxis8.title["text"] == time_col - - assert fig.layout.yaxis.title["text"] == value_col - assert fig.layout.yaxis2.title["text"] == "trend" - assert fig.layout.yaxis3.title["text"] == "autoregression" - assert fig.layout.yaxis4.title["text"] == "lagged_regressor" - assert fig.layout.yaxis5.title["text"] == "weekly" - assert fig.layout.yaxis6.title["text"] == "yearly" - assert fig.layout.yaxis7.title["text"] == "events" - assert fig.layout.yaxis8.title["text"] == "residual" - - # Check plot_silverkite_components with provided component list and warnings - with pytest.warns(Warning) as record: - names = ["YEARLY_SEASONALITY", value_col, "DUMMY"] - title = "Component plot without trend and weekly seasonality" - fig = silverkite_diagnostics.plot_silverkite_components(components, names=names, title=title) - - expected_length = 2 - assert len(fig.data) == expected_length - assert [fig.data[i].name for i in range(len(fig.data))] == [value_col, "YEARLY_SEASONALITY"] - - assert fig.layout.height == expected_length*350 - assert fig.layout.showlegend is True - assert fig.layout.title["text"] == title - assert fig.layout.title["x"] == 0.5 - - assert fig.layout.xaxis.title["text"] == time_col - assert fig.layout.xaxis2.title["text"] == "Time of year" - - assert fig.layout.yaxis.title["text"] == value_col - assert fig.layout.yaxis2.title["text"] == "yearly" - assert f"The following components have not been specified in the model: " \ - f"{{'DUMMY'}}, plotting the rest." in record[0].message.args[0] - - # Check plot_silverkite_components with exception - with pytest.raises(ValueError, match="None of the provided components have been specified in the model."): - names = ["DUMMY"] - silverkite_diagnostics.plot_silverkite_components(components, names=names) - - -def test_group_silverkite_seas_components(): - """Tests group_silverkite_seas_components""" - silverkite_diagnostics: SilverkiteDiagnostics = SilverkiteDiagnostics() - time_col = "ts" - # Daily - date_list = pd.date_range(start="2018-01-01", end="2018-01-07", freq="H").tolist() - time_df = build_time_features_df(date_list, conti_year_origin=2018) - df = pd.DataFrame({ - time_col: time_df["datetime"], - "DAILY_SEASONALITY": time_df["hour"] - }) - res = silverkite_diagnostics.group_silverkite_seas_components(df) - expected_df = pd.DataFrame({ - "Hour of day": np.arange(24.0), - "daily": np.arange(24.0), - }) - assert_frame_equal(res, expected_df) - - # Weekly - date_list = pd.date_range(start="2018-01-01", end="2018-01-20", freq="D").tolist() - time_df = build_time_features_df(date_list, conti_year_origin=2018) - df = pd.DataFrame({ - time_col: time_df["datetime"], - "WEEKLY_SEASONALITY": time_df["tow"] - }) - res = silverkite_diagnostics.group_silverkite_seas_components(df) - expected_df = pd.DataFrame({ - "Day of week": np.arange(7.0), - "weekly": np.arange(7.0), - }) - assert_frame_equal(res, expected_df) - - # Monthly - date_list = pd.date_range(start="2018-01-01", end="2018-01-31", freq="D").tolist() - time_df = build_time_features_df(date_list, conti_year_origin=2018) - df = pd.DataFrame({ - time_col: time_df["datetime"], - "MONTHLY_SEASONALITY": time_df["dom"] - }) - res = silverkite_diagnostics.group_silverkite_seas_components(df) - expected_df = pd.DataFrame({ - "Time of month": np.arange(31.0)/31, - "monthly": np.arange(1.0, 32.0), - }) - assert_frame_equal(res, expected_df) - - # Quarterly (92 day quarters) - date_list = pd.date_range(start="2018-07-01", end="2018-12-31", freq="D").tolist() - time_df = build_time_features_df(date_list, conti_year_origin=2018) - df = pd.DataFrame({ - time_col: time_df["datetime"], - "QUARTERLY_SEASONALITY": time_df["toq"] - }) - res = silverkite_diagnostics.group_silverkite_seas_components(df) - expected_df = pd.DataFrame({ - "Time of quarter": np.arange(92.0)/92, - "quarterly": np.arange(92.0)/92, - }) - assert_frame_equal(res, expected_df) - - # Quarterly (90 day quarter) - date_list = pd.date_range(start="2018-01-01", end="2018-03-31", freq="D").tolist() - time_df = build_time_features_df(date_list, conti_year_origin=2018) - df = pd.DataFrame({ - time_col: time_df["datetime"], - "QUARTERLY_SEASONALITY": time_df["toq"] - }) - res = silverkite_diagnostics.group_silverkite_seas_components(df) - expected_df = pd.DataFrame({ - "Time of quarter": np.arange(90.0)/90, - "quarterly": np.arange(90.0)/90, - }) - assert_frame_equal(res, expected_df) - - # Yearly (non-leap years) - date_list = pd.date_range(start="2018-01-01", end="2019-12-31", freq="D").tolist() - time_df = build_time_features_df(date_list, conti_year_origin=2018) - df = pd.DataFrame({ - time_col: time_df["datetime"], - "YEARLY_SEASONALITY": time_df["toy"] - }) - res = silverkite_diagnostics.group_silverkite_seas_components(df) - expected_df = pd.DataFrame({ - "Time of year": np.arange(365.0)/365, - "yearly": np.arange(365.0)/365, - }) - assert_frame_equal(res, expected_df) diff --git a/greykite/tests/sklearn/estimator/test_silverkite_estimator.py b/greykite/tests/sklearn/estimator/test_silverkite_estimator.py index fe59976..8a925c2 100644 --- a/greykite/tests/sklearn/estimator/test_silverkite_estimator.py +++ b/greykite/tests/sklearn/estimator/test_silverkite_estimator.py @@ -3,7 +3,7 @@ import numpy as np import pandas as pd import pytest -from pandas.util.testing import assert_frame_equal +from pandas.testing import assert_frame_equal from sklearn.linear_model import SGDRegressor from sklearn.metrics import mean_squared_error @@ -16,6 +16,7 @@ from greykite.common.testing_utils import generate_df_for_tests from greykite.common.testing_utils import generate_df_with_reg_for_tests from greykite.sklearn.estimator.silverkite_estimator import SilverkiteEstimator +from greykite.sklearn.estimator.testing_utils import params_component_breakdowns from greykite.sklearn.estimator.testing_utils import params_components @@ -158,6 +159,11 @@ def X(): }) +@pytest.fixture +def expected_component_names(): + return params_component_breakdowns()["expected_component_names"] + + def test_setup(params): """Tests __init__ and attributes set during fit""" coverage = 0.95 @@ -377,11 +383,12 @@ def test_uncertainty(daily_data): (actual <= forecast_upper) & (actual >= forecast_lower) ).mean() - assert round(calc_pred_coverage) == 97, "forecast coverage is incorrect" + # 96.98 on M1 Mac, 97.99 on Intel Mac / Linux. + assert round(calc_pred_coverage) in (97, 98), "forecast coverage is incorrect" -def test_plot_components(): - """Tests plot_components. +def test_plot_components(expected_component_names): + """Tests "plot_components". Because component plots are implemented in `base_silverkite_estimator.py,` the bulk of the testing is done there. This file only tests inheritance and compatibility of the trained_model generated by this estimator's fit. @@ -410,65 +417,16 @@ def test_plot_components(): # suppresses sklearn warning on `iid` parameter for ridge hyperparameter_grid search model.fit(train_df) - # Test plot_components - with pytest.warns(Warning) as record: - title = "Custom component plot" - fig = model.plot_components(names=["trend", "YEARLY_SEASONALITY", "DUMMY"], title=title) - expected_rows = 3 - assert len(fig.data) == expected_rows + 1 # includes changepoints - assert [fig.data[i].name for i in range(expected_rows)] == \ - [cst.VALUE_COL, "trend", "YEARLY_SEASONALITY"] - - assert fig.layout.xaxis.title["text"] == cst.TIME_COL - assert fig.layout.xaxis2.title["text"] == cst.TIME_COL - assert fig.layout.xaxis3.title["text"] == "Time of year" - - assert fig.layout.yaxis.title["text"] == cst.VALUE_COL - assert fig.layout.yaxis2.title["text"] == "trend" - assert fig.layout.yaxis3.title["text"] == "yearly" - - assert fig.layout.title["text"] == title - assert fig.layout.title["x"] == 0.5 - assert f"The following components have not been specified in the model: " \ - f"{{'DUMMY'}}, plotting the rest." in record[0].message.args[0] - - # Test plot_trend - title = "Custom trend plot" - fig = model.plot_trend(title=title) - expected_rows = 2 - assert len(fig.data) == expected_rows + 1 # includes changepoints - assert [fig.data[i].name for i in range(expected_rows)] == [cst.VALUE_COL, "trend"] - - assert fig.layout.xaxis.title["text"] == cst.TIME_COL - assert fig.layout.xaxis2.title["text"] == cst.TIME_COL - - assert fig.layout.yaxis.title["text"] == cst.VALUE_COL - assert fig.layout.yaxis2.title["text"] == "trend" - + # Test "plot_components" + title = "Custom component plot" + fig = model.plot_components(title=title) + expected_rows = 10 + assert len(fig.data) == expected_rows # includes changepoints + assert all([fig.data[i].name in expected_component_names for i in range(expected_rows)]) + assert fig.layout.xaxis.title["text"] == "Date" assert fig.layout.title["text"] == title assert fig.layout.title["x"] == 0.5 - # Test plot_seasonalities - with pytest.warns(Warning): - # suppresses the warning on seasonalities removed - title = "Custom seasonality plot" - fig = model.plot_seasonalities(title=title) - expected_rows = 3 - assert len(fig.data) == expected_rows - assert [fig.data[i].name for i in range(expected_rows)] == \ - [cst.VALUE_COL, "WEEKLY_SEASONALITY", "YEARLY_SEASONALITY"] - - assert fig.layout.xaxis.title["text"] == cst.TIME_COL - assert fig.layout.xaxis2.title["text"] == "Day of week" - assert fig.layout.xaxis3.title["text"] == "Time of year" - - assert fig.layout.yaxis.title["text"] == cst.VALUE_COL - assert fig.layout.yaxis2.title["text"] == "weekly" - assert fig.layout.yaxis3.title["text"] == "yearly" - - assert fig.layout.title["text"] == title - assert fig.layout.title["x"] == 0.5 - def test_autoreg(daily_data): """Runs a basic model with uncertainty intervals diff --git a/greykite/tests/sklearn/estimator/test_simple_silverkite_estimator.py b/greykite/tests/sklearn/estimator/test_simple_silverkite_estimator.py index c833778..d5cb62f 100644 --- a/greykite/tests/sklearn/estimator/test_simple_silverkite_estimator.py +++ b/greykite/tests/sklearn/estimator/test_simple_silverkite_estimator.py @@ -3,7 +3,7 @@ import numpy as np import pandas as pd import pytest -from pandas.util.testing import assert_frame_equal +from pandas.testing import assert_frame_equal from sklearn.linear_model import SGDRegressor from sklearn.metrics import mean_squared_error @@ -18,6 +18,7 @@ from greykite.common.testing_utils import generate_df_for_tests from greykite.common.testing_utils import generate_df_with_reg_for_tests from greykite.sklearn.estimator.simple_silverkite_estimator import SimpleSilverkiteEstimator +from greykite.sklearn.estimator.testing_utils import params_component_breakdowns @pytest.fixture @@ -191,6 +192,11 @@ def X(): }) +@pytest.fixture +def expected_component_names(): + return params_component_breakdowns()["expected_component_names"] + + def test_setup(params): """Tests __init__ and attributes set during fit""" coverage = 0.90 @@ -404,7 +410,7 @@ def test_uncertainty(daily_data): (actual <= forecast_upper) & (actual >= forecast_lower) ).mean() - assert round(calc_pred_coverage) == 95, "forecast coverage is incorrect" + assert round(calc_pred_coverage) == 96, "forecast coverage is incorrect" def test_normalize_method(daily_data): @@ -452,7 +458,7 @@ def test_normalize_method(daily_data): (actual <= forecast_upper) & (actual >= forecast_lower) ).mean() - assert round(calc_pred_coverage) == 95, "forecast coverage is incorrect" + assert round(calc_pred_coverage) == 96, "forecast coverage is incorrect" assert model.model_dict["normalize_method"] == "statistical" @@ -477,8 +483,8 @@ def test_summary(daily_data): model.summary() -def test_plot_components(): - """Tests plot_components. +def test_plot_components(expected_component_names): + """Tests "plot_components". Because component plots are implemented in `base_silverkite_estimator.py,` the bulk of the testing is done there. This file only tests inheritance and compatibility of the trained_model generated by this estimator's fit. @@ -501,65 +507,16 @@ def test_plot_components(): ) model.fit(train_df) - # Test plot_components - with pytest.warns(Warning) as record: - title = "Custom component plot" - fig = model.plot_components(names=["trend", "YEARLY_SEASONALITY", "DUMMY"], title=title) - expected_rows = 3 - assert len(fig.data) == expected_rows - assert [fig.data[i].name for i in range(expected_rows)] == \ - [cst.VALUE_COL, "trend", "YEARLY_SEASONALITY"] - - assert fig.layout.xaxis.title["text"] == cst.TIME_COL - assert fig.layout.xaxis2.title["text"] == cst.TIME_COL - assert fig.layout.xaxis3.title["text"] == "Time of year" - - assert fig.layout.yaxis.title["text"] == cst.VALUE_COL - assert fig.layout.yaxis2.title["text"] == "trend" - assert fig.layout.yaxis3.title["text"] == "yearly" - - assert fig.layout.title["text"] == title - assert fig.layout.title["x"] == 0.5 - assert f"The following components have not been specified in the model: " \ - f"{{'DUMMY'}}, plotting the rest." in record[0].message.args[0] - - # Test plot_trend - title = "Custom trend plot" - fig = model.plot_trend(title=title) - expected_rows = 2 - assert len(fig.data) == expected_rows - assert [fig.data[i].name for i in range(expected_rows)] == [cst.VALUE_COL, "trend"] - - assert fig.layout.xaxis.title["text"] == cst.TIME_COL - assert fig.layout.xaxis2.title["text"] == cst.TIME_COL - - assert fig.layout.yaxis.title["text"] == cst.VALUE_COL - assert fig.layout.yaxis2.title["text"] == "trend" - + # Tests "plot_components" + title = "Custom component plot" + fig = model.plot_components(title=title) + expected_rows = 9 + assert len(fig.data) == expected_rows # includes changepoints + assert all([fig.data[i].name in expected_component_names for i in range(expected_rows)]) + assert fig.layout.xaxis.title["text"] == "Date" assert fig.layout.title["text"] == title assert fig.layout.title["x"] == 0.5 - # Test plot_seasonalities - with pytest.warns(Warning): - # suppresses the warning on seasonalities removed - title = "Custom seasonality plot" - fig = model.plot_seasonalities(title=title) - expected_rows = 3 - assert len(fig.data) == expected_rows - assert [fig.data[i].name for i in range(expected_rows)] == \ - [cst.VALUE_COL, "WEEKLY_SEASONALITY", "YEARLY_SEASONALITY"] - - assert fig.layout.xaxis.title["text"] == cst.TIME_COL - assert fig.layout.xaxis2.title["text"] == "Day of week" - assert fig.layout.xaxis3.title["text"] == "Time of year" - - assert fig.layout.yaxis.title["text"] == cst.VALUE_COL - assert fig.layout.yaxis2.title["text"] == "weekly" - assert fig.layout.yaxis3.title["text"] == "yearly" - - assert fig.layout.title["text"] == title - assert fig.layout.title["x"] == 0.5 - def test_past_df(daily_data): """Tests ``past_df`` is passed.""" @@ -673,7 +630,7 @@ def test_auto_config(): assert "ct1" in model.model_dict["x_mat"].columns assert model.model_dict["changepoints_dict"]["method"] == "custom" # Holidays is overridden by auto seasonality. - assert len(model.model_dict["daily_event_df_dict"]) == 198 + assert len(model.model_dict["daily_event_df_dict"]) == 203 assert "custom_event" in model.model_dict["daily_event_df_dict"] assert "China_Chinese New Year" in model.model_dict["daily_event_df_dict"] diff --git a/greykite/tests/sklearn/test_cross_validation.py b/greykite/tests/sklearn/test_cross_validation.py index 37636b0..cb70dd4 100644 --- a/greykite/tests/sklearn/test_cross_validation.py +++ b/greykite/tests/sklearn/test_cross_validation.py @@ -9,10 +9,10 @@ def assert_splits_equal(actual, expected): :param actual: np.array of np.arrays :param expected: np.array of np.arrays """ - actual = np.array(list(actual)) - for i in range(expected.shape[0]): - for j in range(expected.shape[1]): - np.testing.assert_array_equal(actual[i][j], expected[i][j]) + actual = list(actual) # A list of tuples of `np.array`: [(np.array:train, np.array:test), ...] + for i in range(len(actual)): + np.testing.assert_array_equal(actual[i][0], expected[i][0]) + np.testing.assert_array_equal(actual[i][1], expected[i][1]) def test_rolling_time_series_split(): @@ -26,12 +26,12 @@ def test_rolling_time_series_split(): assert tscv.periods_between_train_test == 0 X = np.random.rand(20, 2) - expected = np.array([ # offset applied, first two observations are not used in CV + expected = [ # offset applied, first two observations are not used in CV (np.array([2, 3, 4, 5, 6, 7]), np.array([8, 9, 10])), (np.array([5, 6, 7, 8, 9, 10]), np.array([11, 12, 13])), (np.array([8, 9, 10, 11, 12, 13]), np.array([14, 15, 16])), (np.array([11, 12, 13, 14, 15, 16]), np.array([17, 18, 19])) - ]) + ] assert tscv.get_n_splits(X=X) == 4 assert_splits_equal(tscv.split(X=X), expected) @@ -55,24 +55,24 @@ def test_rolling_time_series_split2(): assert tscv.periods_between_train_test == 2 X = np.random.rand(20, 4) - expected = np.array([ # no offset + expected = [ # no offset (np.array([0, 1, 2, 3]), np.array([6, 7])), (np.array([0, 1, 2, 3, 4, 5, 6, 7]), np.array([10, 11])), (np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]), np.array([14, 15])), (np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]), np.array([18, 19])) - ]) + ] assert tscv.get_n_splits(X=X) == 4 assert_splits_equal(tscv.split(X=X), expected) X = np.random.rand(25, 4) - expected = np.array([ + expected = [ # offset with expanding window, first training is set larger than min_train_periods to use all data (np.arange(5), np.array([7, 8])), (np.arange(9), np.array([11, 12])), (np.arange(13), np.array([15, 16])), (np.arange(17), np.array([19, 20])), (np.arange(21), np.array([23, 24])) - ]) + ] assert tscv.get_n_splits(X=X) == 5 assert_splits_equal(tscv.split(X=X), expected) @@ -91,12 +91,12 @@ def test_rolling_time_series_split3(): max_splits=max_splits) X = np.random.rand(20, 4) - expected = np.array([ + expected = [ (np.array([0, 1, 2, 3]), np.array([6, 7])), (np.array([0, 1, 2, 3, 4, 5, 6, 7]), np.array([10, 11])), (np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]), np.array([14, 15])), (np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]), np.array([18, 19])) - ]) + ] assert tscv.get_n_splits_without_capping(X=X) == 4 assert tscv.get_n_splits(X=X) == max_splits assert_splits_equal(tscv.split(X=X), expected[-max_splits:]) @@ -130,7 +130,8 @@ def test_rolling_time_series_split3(): max_splits=max_splits) assert tscv.get_n_splits_without_capping(X=X) == 4 assert tscv.get_n_splits(X=X) == max_splits - assert_splits_equal(tscv.split(X=X), expected[[0, 2, 3]]) # picked at random (selection fixed by random seed) + expected_to_assert = [expected[i] for i in [0, 2, 3]] + assert_splits_equal(tscv.split(X=X), expected_to_assert) # picked at random (selection fixed by random seed) # all splits are kept (max_splits == get_n_splits) max_splits = 4 @@ -174,7 +175,8 @@ def test_rolling_time_series_split3(): assert tscv.use_most_recent_splits assert tscv.get_n_splits_without_capping(X=X) == 4 assert tscv.get_n_splits(X=X) == max_splits - assert_splits_equal(tscv.split(X=X), expected[[1, 2, 3]]) + expected_to_assert = [expected[i] for i in [1, 2, 3]] + assert_splits_equal(tscv.split(X=X), expected_to_assert) def test_rolling_time_series_split_empty(): @@ -189,9 +191,9 @@ def test_rolling_time_series_split_empty(): with pytest.warns(Warning) as record: X = np.random.rand(200, 4) - expected = np.array([ + expected = [ (np.arange(180), np.arange(start=180, stop=200)), # 90/10 split - ]) + ] assert tscv.get_n_splits(X=X) == 1 assert_splits_equal(tscv.split(X=X), expected) obtained_messages = "--".join([r.message.args[0] for r in record]) @@ -199,9 +201,9 @@ def test_rolling_time_series_split_empty(): with pytest.warns(Warning) as record: X = np.random.rand(150, 4) - expected = np.array([ + expected = [ (np.arange(135), np.arange(start=135, stop=150)), # 90/10 split - ]) + ] assert tscv.get_n_splits(X=X) == 1 assert_splits_equal(tscv.split(X=X), expected) obtained_messages = "--".join([r.message.args[0] for r in record]) diff --git a/greykite/tests/sklearn/transform/test_null_transformer.py b/greykite/tests/sklearn/transform/test_null_transformer.py index 8f4dafe..04d07a7 100644 --- a/greykite/tests/sklearn/transform/test_null_transformer.py +++ b/greykite/tests/sklearn/transform/test_null_transformer.py @@ -4,11 +4,11 @@ import numpy as np import pandas as pd import pytest -from pandas.util.testing import assert_equal from sklearn.exceptions import NotFittedError from testfixtures import LogCapture from greykite.common.constants import LOGGER_NAME +from greykite.common.python_utils import assert_equal from greykite.sklearn.transform.null_transformer import NullTransformer diff --git a/requirements-dev.txt b/requirements-dev.txt index abb32e9..c190d76 100644 --- a/requirements-dev.txt +++ b/requirements-dev.txt @@ -1,50 +1,48 @@ +appnope==0.1.3 bump2version==0.5.11 -Click==7.0 -coverage==4.5.4 -cvxpy==1.1.12 -dill==0.3.3 -flake8==3.7.8 +Click==8.0.1 +coverage==6.3.2 +cvxpy==1.2.1 +dill==0.3.6 +flake8==6.0.0 holidays==0.13 -holidays-ext==0.0.7 -ipykernel==4.8.2 -ipython==7.1.1 -ipywidgets==7.2.1 +holidays-ext==0.0.8 +ipykernel==6.4.1 +ipython==7.31.1 +ipywidgets==7.5.1 jupyter==1.0.0 -jupyter-client==6.1.7 -jupyter-console==6.4.0 -jupyter-core==4.7.1 -jupyter-server==0.1.1 +jupyter-client==7.4.7 +jupyter-console==6.4.4 +jupyter-core==5.2.0 +jupyter-server==2.2.1 LunarCalendar==0.0.9 -matplotlib==3.4.1 -nbformat==5.1.3 -notebook==5.4.1 -numpy==1.20.2 +matplotlib==3.5.2 +nbformat==5.5.0 +notebook==6.5.2 +numpy==1.23.2 osqp==0.6.1 -overrides==2.8.0 -pandas==1.1.3 -patsy==0.5.1 -Pillow==8.0.1 +overrides==7.3.1 +pandas==1.5.0 +patsy==0.5.2 +Pillow==9.4.0 pip==20.3.3 plotly==5.4.0 -pmdarima==1.8.0 -prophet==1.0 -pystan==2.19.0.0 -pytest==4.6.5 -pytest-runner==5.1 -pyzmq==22.0.3 -requests==2.22.0 -scipy==1.5.4 -seaborn==0.9.0 -six==1.15.0 -scikit-learn==0.24.1 -Sphinx==3.2.1 +pmdarima==1.8.5 +pytest==7.1.2 +pytest-runner==5.3.1 +pyzmq==25.0.0 +requests==2.28.2 +scipy==1.8.0 +seaborn==0.11.1 +six==1.16.0 +scikit-learn==1.1.2 +Sphinx==5.3.0 sphinx-gallery==0.10.1 -sphinx-rtd-theme==0.4.2 -statsmodels==0.12.2 +sphinx-rtd-theme==1.1.1 +statsmodels==0.13.5 testfixtures==6.14.2 -tornado==5.1.1 -tox==3.14.0 -tqdm==4.52.0 +tornado==6.2 +tqdm==4.64.1 twine==1.14.0 watchdog==0.9.0 wheel==0.33.6 diff --git a/setup.cfg b/setup.cfg index 3de6c19..bf0a6e8 100644 --- a/setup.cfg +++ b/setup.cfg @@ -1,5 +1,5 @@ [bumpversion] -current_version = 0.3.0 +current_version = 0.5.0 commit = True tag = True diff --git a/setup.py b/setup.py index cc1892a..6f1b4ee 100644 --- a/setup.py +++ b/setup.py @@ -2,7 +2,6 @@ """The setup script.""" -import sys from setuptools import setup, find_packages @@ -11,17 +10,18 @@ requirements = [ - "cvxpy>=1.1.12", + "cvxpy>=1.2.1", "dill>=0.3.3", "holidays-ext>=0.0.7", + "ipython>=7.31.1", "matplotlib>=3.4.1", - "numpy>=1.19.2", - "osqp==0.6.1", # osqp>=0.6.2 uses qdldl which could cause install failure. + "numpy>=1.22.0", # support for Python 3.10 + "osqp>=0.6.1", "overrides>=2.8.0", - "pandas>=1.1.3, <1.3", # pandas 1.3 changes behavior of bfill, ffill + "pandas>=1.5.0, <2.0.0", "patsy>=0.5.1", "plotly>=4.12.0", - "pmdarima>=1.8.0", + "pmdarima>=1.8.0, <=1.8.5", "pytest>=4.6.5", "pytest-runner>=5.1", "scipy>=1.5.4", @@ -36,14 +36,13 @@ test_requirements = ["pytest>=3", ] setup( - python_requires=">=3.7", + python_requires=">=3.10", classifiers=[ "Development Status :: 2 - Pre-Alpha", "Intended Audience :: Developers", "License :: OSI Approved :: BSD License", "Natural Language :: English", - "Programming Language :: Python :: 3.7", - "Programming Language :: Python :: 3.8" + "Programming Language :: Python :: 3.10" ], description="A python package for flexible forecasting", long_description=LONG_DESC, @@ -64,6 +63,6 @@ test_suite="tests", tests_require=test_requirements, url="https://github.com/linkedin/greykite", - version="0.4.0", + version="0.5.0", zip_safe=False, )