Added loaders parameter to run_algorithm() in run_algo.py #2199

calmitchell617 · 2018-05-31T17:02:43Z

Added an optional parameter to run_algorithm() that takes a dictionary of PipelineLoader objects. Allows you to load Columns of data that aren't from USEquityLoader while running a backtest.

This is my first pull request on any meaningful project, constructive criticism is very welcome.

ssanderson

@calmitchell617 thanks for the PR! This looks great, especially for a first PR! I think there's a bug currently in the case where loaders isn't passed, and I have some ideas about how we might improve the ergonomics of the API.

It looks like the API proposed here is that the user would pass a dictionary mapping BoundColumn to PipelineLoader? That makes sense for the case where the user is supplying DataFrameLoaders (which can only handle one column at a time), but another common case is to have a loader for each of your datasets, which would be a bit awkward to specify in this API.

A user would have to do something like:

def make_column_to_loader_map(my_datasets: List[Dataset],
                              my_loaders: List[Loader]):
    col_to_loader = {}

    for dataset, loader in zip(my_datasets, my_loaders):
        col_to_loader.update(dict.fromkeys(dataset.columns, loader))

    return col_to_loader

A few ideas on how we might improve this interface:

We could allow passing a dictionary that either maps columns or datasets to loaders, in which case our internal function might turn into something like:

us_price_loader = USEquityPricingLoader(
    bundle_data.equity_daily_bar_reader,
    bundle_data.adjustment_reader,
)

if user_supplied_loaders is None:
    user_supplied_loaders = {USEquityPricing: us_price_loader}
else:
    # Allow the user to provide their own USEquityPricing loader if they pass one explicitly.
    user_supplied_loaders.setdefault(USEquityPricing, us_price_loader)

def choose_loader(column):
    if column in user_supplied_loaders:
        return loaders[column]
    elif column.dataset in user_supplied_loaders:
        return loaders[column.dataset]
    else:
        raise ValueError(...)

We could keep this interface the same, but provide a helper function to convert a map from dataset -> loader into a map from column -> loader.
We could keep the current interface that requires a function from column -> loader, but provide a helper class for implementing common patterns, e.g.:

class PipelineDispatcher(object):
    """Helper class for building a dispatching function for a PipelineLoader.

    Parameters
    ----------
    column_loaders : dict[BoundColumn -> PipelineLoader]
        Map from columns to pipeline loader for those columns.
    dataset_loaders : dict[DataSet -> PipelineLoader]
        Map from datasets to pipeline loader for those datasets.
    """
    def __init__(self, column_loaders, dataset_loaders):
        self._column_loaders = column_loaders
        self._dataset_loaders = dataset_loaders

    def __call__(self, column):
        if column in self._column_loaders:
            return self._column_loaders[column]
        elif column.dataset in self._dataset_loaders:
            return self._dataset_loaders[column.dataset]
        else:
            raise LookupError("No pipeline loader registered for %s", column)

I think that last model would probably be the easiest one to extend to support registering datasets/loaders from a zipline extension (e.g., we could keep a global default instance of PipelineDispatcher) and provide hooks to register custom loaders/datasets, which internally would update that global default instance.

cc @richafrank @llllllllll for general API design thoughts here, and cc @jnazaren since this is relevant to your work on improving the extensibility of Zipline.

ssanderson · 2018-05-31T17:51:18Z

zipline/utils/run_algo.py

-            raise ValueError(
-                "No PipelineLoader registered for column %s." % column
-            )
+            elif column in loaders:


I think this will crash in the current implementation if the user doesn't supply any custom loaders (because loaders will be None, so column in loaders will barf).

calmitchell617 · 2018-05-31T19:30:08Z

@ssanderson, I changed some language to show that this argument is used for loading dataframes, and changed the choose_loader() function so it doesn't freak out when loaders is None:

def choose_loader(column):
            if column in USEquityPricing.columns:
                return pipeline_loader
            elif data_frame_loaders and column in data_frame_loaders:
                return data_frame_loaders[column]
            else:
                raise ValueError(
                    "No PipelineLoader registered for column %s." % column
                )

I will try to implement your suggestions.

Just playing devil's advocate, can you think of any cases where it would be better to have separate, optional arguments for different types of loaders, in the run_algorithm() function? (eg: an argument for Dataframe loaders, and a different argument for Dataset loaders).

llllllllll · 2018-06-29T18:50:57Z

zipline/utils/run_algo.py

@@ -71,7 +71,8 @@ def _run(handle_data,
         print_algo,
         metrics_set,
         local_namespace,
-         environ):
+         environ,
+         data_frame_loaders):


I don't think there is anything special about this being a dataframe loader, right?

llllllllll · 2018-06-29T18:51:39Z

zipline/utils/run_algo.py

@@ -344,6 +349,9 @@ def run_algorithm(start,
    environ : mapping[str -> str], optional
        The os environment to use. Many extensions use this to get parameters.
        This defaults to ``os.environ``.
+    loaders : iterable{PipelineLoader}, optional


the docstring and parameter name don't match here.

jnazaren · 2018-07-11T20:38:38Z

I've created PR #2246 with one of the solutions suggested above

calmitchell617 added 13 commits May 31, 2018 11:57

Added ability to use a custom loader in run_algorithm()

d523c83

run_algorithm() can now accept an optional dictionary of PipelineLoaders

2a8018b

ignoring vscode setting json file

d1e96b4

removed vscode json settings

bc0ca30

ignore gitignore

4579095

removed gitignore

c4dd280

added gitignore back

db45582

changed some formatting

ef6c860

changed formatting

3bfeb7b

formatting

9451c4e

formatting

67ee985

sorry for all these formatting changes

01e3ef9

just trying to keep consistent with what was already here

339923b

ssanderson reviewed May 31, 2018

View reviewed changes

changed language to show that this argument is for loading dataframes

ebcf747

llllllllll reviewed Jun 29, 2018

View reviewed changes

jnazaren mentioned this pull request Jul 11, 2018

ENH: Add extension point for dataset-loader associations #2246

Open

ernestoeperez88 closed this Jan 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added loaders parameter to run_algorithm() in run_algo.py #2199

Added loaders parameter to run_algorithm() in run_algo.py #2199

calmitchell617 commented May 31, 2018

ssanderson left a comment

ssanderson May 31, 2018

calmitchell617 commented May 31, 2018

llllllllll Jun 29, 2018

llllllllll Jun 29, 2018

jnazaren commented Jul 11, 2018

Added loaders parameter to run_algorithm() in run_algo.py #2199

Added loaders parameter to run_algorithm() in run_algo.py #2199

Conversation

calmitchell617 commented May 31, 2018

ssanderson left a comment

Choose a reason for hiding this comment

ssanderson May 31, 2018

Choose a reason for hiding this comment

calmitchell617 commented May 31, 2018

llllllllll Jun 29, 2018

Choose a reason for hiding this comment

llllllllll Jun 29, 2018

Choose a reason for hiding this comment

jnazaren commented Jul 11, 2018