Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saving the trace #248

Open
alexlyttle opened this issue May 18, 2020 · 11 comments
Open

Saving the trace #248

alexlyttle opened this issue May 18, 2020 · 11 comments

Comments

@alexlyttle
Copy link

I love the progress with PyMC4 so far. Are there plans to implement a save_trace function in PyMC4?

I have been trying the following:

import pymc4 as pm

...

model = example_model()
trace = pm.sample(model, burn_in=500, num_samples=1000, num_chains=10, xla=True \
    adaptation_kwargs=dict(target_accept_prob=0.95))

filepath = 'trace.nc'
trace.to_netcdf(filepath)

and it gives this error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
~/.virtualenvs/pymc4/lib/python3.6/site-packages/xarray/backends/api.py in to_netcdf(dataset, path_or_file, mode, format, group, engine, encoding, unlimited_dims, compute, multifile, invalid_netcdf)
   1088         dump_to_store(
-> 1089             dataset, store, writer, encoding=encoding, unlimited_dims=unlimited_dims
   1090         )

~/.virtualenvs/pymc4/lib/python3.6/site-packages/xarray/backends/api.py in dump_to_store(dataset, store, writer, encoder, encoding, unlimited_dims)
   1134 
-> 1135     store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
   1136 

~/.virtualenvs/pymc4/lib/python3.6/site-packages/xarray/backends/common.py in store(self, variables, attributes, check_encoding_set, writer, unlimited_dims)
    295         self.set_attributes(attributes)
--> 296         self.set_dimensions(variables, unlimited_dims=unlimited_dims)
    297         self.set_variables(

~/.virtualenvs/pymc4/lib/python3.6/site-packages/xarray/backends/common.py in set_dimensions(self, variables, unlimited_dims)
    372                 is_unlimited = dim in unlimited_dims
--> 373                 self.set_dimension(dim, length, is_unlimited)
    374 

~/.virtualenvs/pymc4/lib/python3.6/site-packages/xarray/backends/netCDF4_.py in set_dimension(self, name, length, is_unlimited)
    420         dim_length = length if not is_unlimited else None
--> 421         self.ds.createDimension(name, size=dim_length)
    422 

netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Dataset.createDimension()

netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Dimension.__init__()

netCDF4/_netCDF4.pyx in netCDF4._netCDF4._ensure_nc_success()

RuntimeError: NetCDF: Name contains illegal characters

The illegal characters seem to be the '/' in the variable names. I have a lot of variables, but could manually replace the characters if need be. However, being able to save them as they are would be better, especially for reloading the trace.

Do you have any suggestions on saving the trace to allow such characters? If the '/' character makes saving problematic, are there any alternatives which may be specified when PyMC4 creates the variable names.

@alexlyttle
Copy link
Author

Update: This seems to work for saving the posterior, using xarray's dataset to netcdf function with the 'scipy' engine.

trace.posterior.to_netcdf('posterior.h5', engine='scipy')

reload with

from xarray import open_dataset

posterior = open_dataset('trace.h5', engine='scipy')

I think I could do this for all the groups in trace, just trace.to_netcdf and az.to_netcdf(trace) don't seem to work for me, even when I replace the '/' in the variable names with '.' manually.

Has anyone got any better ideas for saving/loading the trace object? I'm fairly new to xarray and arviz. I think it would be useful to have a save trace function in PyMC4 in future and I'd be happy to contribute at some point 😁 I'm excited at the prospect of using PyMC4 in my work so happy to help with its progress!

@ericmjl
Copy link
Member

ericmjl commented May 20, 2020

Hi @alexlyttle! Thanks for pinging in. This is a design interface decision between ArviZ devs and PyMC4 devs - the namespacing in pm4 is probably the reason why the error you saw is happening.

Tagging here @canyon289 @ColCarroll @AlexAndorra and @aloctavodia, who I think are the more active ones on ArviZ amongst us. They probably know better what kind of API changes are needed - perhaps pushing up the engine kwarg to arviz' top-level API, or collecting kwargs?

@AlexAndorra
Copy link
Contributor

Hi @alexlyttle, and thanks for the ping @ericmjl !
I don't think this is on our radar, as PyMC4 is not production ready yet. That being said, the fact that you can do it with xarray's to_netcdf function is probably a hint that this feature could be quite easy to implement.
Pinging @OriolAbril and @ahartikainen as they know more than me about this.

@OriolAbril
Copy link
Member

OriolAbril commented May 23, 2020

We have to look carefully into this, IIRC, the / are used as group identifiers and on the ArviZ side we rely heavily on groups. That being said, subgroups are allowed, so this should not break anything if done properly.

I am not following pymc4 closely enough, so the following proposal may make no sense. If the variable names are model_name/var_name maybe it is possible to drop the starting model_name and store it as an attribute of the inference data object? There are several places where we use model_names (which currently have to be passed as dict keys to az.compare, az.plot_elpd...) and are considering adding a name attribute to inference data objects.

@alexlyttle
Copy link
Author

Thanks @ericmjl for tagging the relevant people!

Hi @AlexAndorra thanks for the response. I understand this isn't priority atm but thought it worth mentioning. With regards to PyMC4 not being production ready: this may be a separate issue for discussion elsewhere, but we have been using PyMC3 in our research for a while but our particular model takes ~ hours to run. I recently converted it to PyMC4 and it takes ~ 10 mins due to GPU XLA! This is a game-changer for us and look forward to the full release! We are making sure to keep track of the PyMC4 version, being fully aware it is pre-release, but wondered if it would be fine to use in published work (despite the disclaimer in the README)? My opinion is that if we run final model in PyMC3 and the results differ negligibly, it's fine. If this needs more discussion I can get in touch another way or in a different issue.

Thanks @OriolAbril this makes a lot of sense to me. I think in PyMC4 you can have model_name/sub_model_name/var_name as in this example but the devs can correct me if I'm wrong! Maybe something which just encodes/decodes the / for loading/reloading would work. The issue with my idea of saving using the 'scipy' engine is that / are replaced with . but then are still . upon reload.

@lucianopaz
Copy link
Contributor

@alexlyttle, we're thrilled that your model achieves such a big speedup with pymc4 (and GPU). Could I ask you a few questions? How big is your model, with this I mean how many independent variables does it have to sample from and how many observed data points?

About the naming convention used by pymc4, you are correct. We follow tensorflow's policy of name scopes, which can be nested. Each name snipe is separated by forward slashes, and the rightmost part is the plain name of the variable.

We could do a quick fix by implementing a custom save_trace and load_trace that changes the forward slashes to some combination of characters and then reverses that when loading. It's not ideal really but it should be easy to do.

A more complicated but cleaner solution would be to take advantage of the hdf5 and net_cdf groups. The root level of each name scope can be considered a group because it's a model that contains a group of random variables and potentially nested models (nested groups). I consider this to be cleaner but it could potentially impact the arviz groups design, so I don't know what's the best way to do proceed.

@alexlyttle
Copy link
Author

@lucianopaz thanks for your interest! It's possible the main contribution to the speed-up is that we are using a regression neural network (fully-connected 6 x 128 hidden neurons) trained using TensorFlow, which maps inputs to outputs (some of which are observables). This is converted to Theano in order to sample in our PyMC3 model (which cannot utilise the GPU efficiently). In PyMC4 we are able to sample the trained neural net entirely on the GPU. It's a hierarchical model with 5 independent hyper-parameters and 5 object-level parameters which feed in to the neural net. There are 4 observables derived from the neural net output for ~100 data points. I hope this made some sense!

I like the sound of the cleaner solution. I may have a go doing something like that in my code and feedback here as I'd be happy to help in some way towards this, but I'm no expert of hdf5 or net_cdf.

@OriolAbril
Copy link
Member

I am not sure saving the variables in a nested group structure would be cleaner in practice. If the implementation and specification of the data structure where a little different, the situation would be different.

When saving, we'd have to give the string after the last / as variable name and the rest as the group, it is an operation that has to be performed on each variable. This is not difficult to do but it is quite inconvenient, we currently call save once on an xarray dataset which stores all variables at once (in the same group).

When loading, we'd have to recursively inspect the groups manually in order to get all variables, load them and once loaded merge them all together into a single xarray dataset. Currently the whole posterior dataset is loaded at once and directly as an xarray dataset.

Moreover, the nested group structure would not be part of the loaded dataset, users cannot select variables based on the values before the / using xarray. It can be done in ArviZ (starting from latest release) with regular expression matching in var_names argument, which is independent of the /, using . or : would work in the same way.

And finally, groups/datasets are independent, while coordinate values are shared in groups/dataset objects (that is, the chain, draw, and extra dimensions labels are stored once per group/dataset, not once per variable). Thus, using nested groups we would be storing the exact same coordinate labels multiple times.

@lucianopaz
Copy link
Contributor

Thanks for the feedback @alexlyttle. About publishing, you have to be aware that at the moment, pymc4 has no initial tuning for the NUTS sampler. This means that the mass matrix (which is the covariance matrix of the momentum proposals done at the begining of each HMC step) will not be automatically tuned to the model you are using. This can lead to poor exploration of the state space, suboptimal acceptance rate and also potentially to more divergences during sampling. This does not mean that you cannot use the traces that you get from pymc4, but it means that you have to carefully diagnose your results. You will have to look at the effective sample size (depends on the autocorrelation of the samples in the trace), the Rhat (depends on how well mixed each chain is and if every chain reaches the same steady state), the warmup period in which the samples go from the initial proposal to the steady state typical set, and finally, you should also look at the energy coverage to diagnose whether your MCMC explored the full posterior parameter space or if you are getting biased answers.
Actually, you should always try to look at these quantities, but without a good estimate of the mass matrix, it's almost certain that your trace will have some problematic behavior. At the moment, the only thing that you can do about that is to try to change the step_size by hand to improve your trace, or to draw more samples from your chain and then thin it by slicing it (as one usually does with metropolis hastings).

@lucianopaz
Copy link
Contributor

@OriolAbril, ok so we will have to write custom save_trace and load_trace functions in pymc4 that change the / character under the hood to something that doesn't break net_cdf.

@alexlyttle
Copy link
Author

Thanks for the advice @lucianopaz! Actually, I have been noticing divergences and some other issues with the chains. With regards to changing the step_size I tried implementing something similar to in the "radon_hierarchical" notebook, but will need to take some time to fully understand this. We'll make sure to check these things (eff. samples, rhat, etc.) before publishing, and will move back to PyMC3 if need be if we struggle with manually tuning. I hope to write a blog post soon with our PyMC4 model to show off its potential for future research. When I do, I'll link it in this thread in case you're interested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants