Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added loaders parameter to run_algorithm() in run_algo.py #2199

Closed
wants to merge 14 commits into from

Conversation

calmitchell617
Copy link

Added an optional parameter to run_algorithm() that takes a dictionary of PipelineLoader objects. Allows you to load Columns of data that aren't from USEquityLoader while running a backtest.

This is my first pull request on any meaningful project, constructive criticism is very welcome.

Copy link
Contributor

@ssanderson ssanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@calmitchell617 thanks for the PR! This looks great, especially for a first PR! I think there's a bug currently in the case where loaders isn't passed, and I have some ideas about how we might improve the ergonomics of the API.

It looks like the API proposed here is that the user would pass a dictionary mapping BoundColumn to PipelineLoader? That makes sense for the case where the user is supplying DataFrameLoaders (which can only handle one column at a time), but another common case is to have a loader for each of your datasets, which would be a bit awkward to specify in this API.

A user would have to do something like:

def make_column_to_loader_map(my_datasets: List[Dataset],
                              my_loaders: List[Loader]):
    col_to_loader = {}

    for dataset, loader in zip(my_datasets, my_loaders):
        col_to_loader.update(dict.fromkeys(dataset.columns, loader))

    return col_to_loader

A few ideas on how we might improve this interface:

  • We could allow passing a dictionary that either maps columns or datasets to loaders, in which case our internal function might turn into something like:
us_price_loader = USEquityPricingLoader(
    bundle_data.equity_daily_bar_reader,
    bundle_data.adjustment_reader,
)

if user_supplied_loaders is None:
    user_supplied_loaders = {USEquityPricing: us_price_loader}
else:
    # Allow the user to provide their own USEquityPricing loader if they pass one explicitly.
    user_supplied_loaders.setdefault(USEquityPricing, us_price_loader)

def choose_loader(column):
    if column in user_supplied_loaders:
        return loaders[column]
    elif column.dataset in user_supplied_loaders:
        return loaders[column.dataset]
    else:
        raise ValueError(...)
  • We could keep this interface the same, but provide a helper function to convert a map from dataset -> loader into a map from column -> loader.
  • We could keep the current interface that requires a function from column -> loader, but provide a helper class for implementing common patterns, e.g.:
class PipelineDispatcher(object):
    """Helper class for building a dispatching function for a PipelineLoader.

    Parameters
    ----------
    column_loaders : dict[BoundColumn -> PipelineLoader]
        Map from columns to pipeline loader for those columns.
    dataset_loaders : dict[DataSet -> PipelineLoader]
        Map from datasets to pipeline loader for those datasets.
    """
    def __init__(self, column_loaders, dataset_loaders):
        self._column_loaders = column_loaders
        self._dataset_loaders = dataset_loaders

    def __call__(self, column):
        if column in self._column_loaders:
            return self._column_loaders[column]
        elif column.dataset in self._dataset_loaders:
            return self._dataset_loaders[column.dataset]
        else:
            raise LookupError("No pipeline loader registered for %s", column)

I think that last model would probably be the easiest one to extend to support registering datasets/loaders from a zipline extension (e.g., we could keep a global default instance of PipelineDispatcher) and provide hooks to register custom loaders/datasets, which internally would update that global default instance.

cc @richafrank @llllllllll for general API design thoughts here, and cc @jnazaren since this is relevant to your work on improving the extensibility of Zipline.

raise ValueError(
"No PipelineLoader registered for column %s." % column
)
elif column in loaders:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will crash in the current implementation if the user doesn't supply any custom loaders (because loaders will be None, so column in loaders will barf).

@calmitchell617
Copy link
Author

@ssanderson, I changed some language to show that this argument is used for loading dataframes, and changed the choose_loader() function so it doesn't freak out when loaders is None:

def choose_loader(column):
            if column in USEquityPricing.columns:
                return pipeline_loader
            elif data_frame_loaders and column in data_frame_loaders:
                return data_frame_loaders[column]
            else:
                raise ValueError(
                    "No PipelineLoader registered for column %s." % column
                )

I will try to implement your suggestions.

Just playing devil's advocate, can you think of any cases where it would be better to have separate, optional arguments for different types of loaders, in the run_algorithm() function? (eg: an argument for Dataframe loaders, and a different argument for Dataset loaders).

@@ -71,7 +71,8 @@ def _run(handle_data,
print_algo,
metrics_set,
local_namespace,
environ):
environ,
data_frame_loaders):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is anything special about this being a dataframe loader, right?

@@ -344,6 +349,9 @@ def run_algorithm(start,
environ : mapping[str -> str], optional
The os environment to use. Many extensions use this to get parameters.
This defaults to ``os.environ``.
loaders : iterable{PipelineLoader}, optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the docstring and parameter name don't match here.

@jnazaren
Copy link
Contributor

I've created PR #2246 with one of the solutions suggested above

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants