Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistent eltype and allow to specify type in rand #1433

Closed
wants to merge 1 commit into from
Closed

Conversation

devmotion
Copy link
Member

Scope of the PR

The PR tries to address some of the problems with rand and eltype - but it does not intend to solve all inconsistencies and problems. I try to emphasize this very strongly here because I really want to avoid that this PR ends up just like all the other eltype discussions - which eventually means it will lead nowhere and nothing will change or be improved. It would be great if we could keep the discussion in this PR focused on the changes in this PR and on the question whether these changes are useful or not.

In particular, this PR is not concerned with whether eltype should be replaced by e.g. Random.gentype and/or return the type of the samples from rand instead of the element type of the samples (as it is documented and implemented currently). Hence I suggest discussing these things in other issues (probably for most things already an issue exists).


Proposal

This PR is a proposal but not complete yet and would benefit from the generalizations of rand in #1391. However, I would like to start a discussion of the suggested changes.

The PR

  • makes the element type of samples produced by rand, defined by eltype(::Type{<:Distribution}), more consistent and defines it as Int or Float64 for most distributions
  • adds rand(::Type{T}, ::Sampleable, ...) methods that allow to specify non-standard element types of samples (similar to e.g. rand(T))

The first change is already the default currently for many distributions and affects only the odd implementations of Normal, MvNormal, MvNormalCanon, MvLogNormal, GenericMvTDist, Dirichlet, and LKJCholesky. Hence it could be viewed as bugfix but I think it is safer to make a breaking release. The second change is non-breaking.

Fixes #1071, #1082, #960, #1041, #1355, #1045.

Fixes #1163 and #821:

julia> rand(Poisson(0.2f0))
0

julia> x = rand(Float64, Poisson(1e20))
1.0000000000521694e20

julia> isinteger(x)
true

Motivation

There are many open and closed issues and PRs about the current eltype situation. One issue is that some distributions such as Normal define eltype based on the parameters whereas most distributions define eltype without taking them into account. In the first case hence one can change the type of the parameters of a distribution to achieve that rand returns samples of a different type whereas in the latter case one can't. Thus the definition of eltype is directly related to the question of how one can obtain samples whose element types are not Float64 or Int, the default types for ContinuousDistribution and DiscreteDistribution.

I argue that often it is not reasonable to define eltype based on the parameters of a distribution and instead it is necessary to be able to specify the type of the samples separately. For instance, usually the parameter of a Poisson distribution is a floating point number but the samples are integer-valued and hence often expected to be of type Int. It does not seem possible to set and adjust the type of the samples based on the parameter type.

Another indication that the current design is a bit ad-hoc is that e.g. in the odd case of Normal which defines eltype as the type of the parameters (i.e., identical to partype which is the interface function for the type of the parameters) eltype may return Int: One can define a Normal distribution with integer-valued parameters which is useful e.g. if one wants to perform calculations, e.g., of logpdf, with points x of different types such as Float32 and Float64 and wants to avoid undesired type promotions. However, clearly the samples from a Normal distribution with integer-valued parameters are not integers. Currently, we use the workaround float(eltype(d)) and return samples of type Float64 in this case. But this is just a heuristic and so while we avoid promotions in e.g. logpdf it does not allow to generate samples of type Float32 if the parameters are integer-valued.

Additionally, the current design is inconsistent as noted in many issues and PRs: Why should the parameters of Normal affect the type of samples from rand but e.g. samples from Uniform are always of type Float64, independent of the parameters?

These observations motivate the proposal in this PR to

  • define eltype for most distribution in a parameter-independent way (default: Float64 and Int for ContinuousDistributions and DiscreteDistributions)
  • allow to specify a different sample type in the rand call directly.

The first point fixes the inconsistent behaviour of Normal, MvNormal, MvLogNormal, and Dirichlet. It allows us to remove the float(eltype(...)) heuristic completely. The second point ensures that it is still possible to adjust the types of the samples. However, now this can be done independent of the parameters, e.g. one does not have to change the parameters to Float32 to obtain Float32 samples and one can generate Float32 samples from integer-valued Normal distributions or Poisson distributions with a parameter of type Float64.

There are some distributions though for which it makes sense and is even required to base the default type of the samples on some parameter:

  • Dirac: it seems reasonable to "sample" the value of the Dirac distribution by returning its parameter without any conversions (the parameter can be any Real and is also not restricted to Int)
  • DiscreteNonParametric: similarly, it makes sense to sample from the provided support without conversion (again the values are also not necessarily of type Int)
  • LocationScale for DiscreteDistributions: Scaling and shifting can promote the type of the support of a DiscreteDistribution e.g. to Rationals, and hence the default type of the samples should take into account the parameters of LocationScale and not be set to the sample type of the unscaled distribution (which could be e.g. of type Int)

Alternatives

The second point (rand(::Type{T}, ::Sampleable)) seems generally useful, less controversial, and non-breaking. Hence I did not consider any alternative there.

The first point is more controversial as it changes e.g. the type of samples from Normal(0f0) and hence, in my opinion, is breaking (if one does not consider it as a bugfix). An alternative would be e.g. to define eltype based on the parameters if the parameters are clearly and directly related to the support: E.g.,

  • for Normal eltype would still be based on the parameters as the mean parameter is directly related to the support,
  • eltype of Uniform would be changed such that it is based on the type of the bounds which define the support,
  • but eltype of Beta would be set to Float64 and of Poisson to Int since its parameters are not directly related to the support

However, this would not fix the float(eltype(...)) and one would still have to deal with non-floating point parameters. An approach could be to include it in the definition of eltype directly instead of applying it as a heuristic in the rand calls: E.g., one could define Base.eltype(::Type{Normal{T}}) where {T} = float(T) - the parameter type could still be retrieved unmodified with partype.

I am not sure if this would be more confusing since the definition of eltype would be less consistent. Also maybe sometimes it could be a bit difficult to decide if eltype should be based on the parameters or not. On the other hand, maybe it would be more intuitive since e.g. there are already two open PRs that want to define Base.eltype(::Type{Uniform{T}}) where {T} = float(T) (in the PRs probably T but again parameters don't have to be floating point numbers). Possibly therefore such a change would be less surprising and could be viewed as a non-breaking bugfix.

@andreasnoack
Copy link
Member

Thanks for PR and the careful writeup. I support these changes.

@cscherrer
Copy link

I can see how this approach to rand(::Type{T}, ::Sampleable) ("T specifies the eltype of the result") makes sense for Sampleable subtypes that generate <:Real or <:AbstractArray{<:Real}. But how would like work if the Sampleable generates a tuple or named tuple, or even a Vector{Vector{Float}}?

I can understand T being the <:AbstractFloat that's passed to the low-level rand, or to have T be the type of the resulting sample. But for general Sampleable subtypes, eltype seems like an awkward middle ground.

@devmotion
Copy link
Member Author

devmotion commented Dec 16, 2021

I think these questions touch more general limitations of Random and of the eltype design - both which, as stated above, this PR intentionally does not want to address or change. This PR is supposed to be a major improvement of the rand API with almost/basically no breaking changes and with a convenient and intuitive user experience. However, clearly and on purpose it does not fix other problems with rand. The main reason is that it is much easier to discuss, to implement, and to release (ideally non-breaking) improvements if they only change one/a few things at a time.

Since Distributions (apart from LKJ) only implements Distributions of ArrayLikeVariates, i.e., variates of type Real and Array{<:Real}, distributions of other types such as e.g. NamedTuples are not a major concern currently (in Distributions). That being said, I don't think this PR limits or worsens the design of rand for such cases since

  • packages that work with distributions of e.g. Tuples or NamedTuples are free to use e.g. an API such as just rand(rng, dist) or possibly rand(rng, ::Type{NamedTuple{(:a, :b),Tuple{Float64,Float32}}, dist) or rand(rng, ::Type{Tuple{Float64,Float32}}, dist) - or whatever API seems reasonable to them,
  • for the most common case of distributions of ArrayLikeVariates it seems very convenient to be able to specify just rand(rng, ::Type{Float32}, dist) instead of e.g. rand(rng, ::Type{Array{Float32,3}}, dist) - so even if eltype is changed or removed, the concise rand syntax seems useful and convenient (which is also consistent with rand(Float32, (3, 2)) etc. which uses Array containers by default).

@cscherrer
Copy link

Thanks @devmotion . As you probably guess, my main concern here is that we have something very similar in MeasureTheory, and many measures correspond to types other than arrays. Differences between Distributions and MeasureTheory aren't such a big deal, but it's convenient for users when there can be commonality.

It had seemed at first you were suggesting an invariant like

eltype(rand(::Type{T}, ::Sampleable)) == T

But from this

  • packages that work with distributions of e.g. Tuples or NamedTuples are free to use e.g. an API such as just rand(rng, dist) or possibly rand(rng, ::Type{NamedTuple{(:a, :b),Tuple{Float64,Float32}}, dist) or rand(rng, ::Type{Tuple{Float64,Float32}}, dist) - or whatever API seems reasonable to them,

it sounds like this could vary case by case, and may not be expected to satisfy any particular invariant. Is that right?

@devmotion
Copy link
Member Author

it sounds like this could vary case by case, and may not be expected to satisfy any particular invariant. Is that right?

I expect that every Distribution{<:ArrayLikeVariate} (i.e., all common distributions in Distributions) have to satisfy this invariant. But since eltype (which I don't want to touch here to keep changes minimal) is focused on these types of distributions but doesn't make sense for arbitrary distributions, in my opinion the property only makes sense for these distributions where eltype is a more or less reasonable thing (which, however, are all commonly used distributions).

@bgctw
Copy link
Contributor

bgctw commented Sep 8, 2023

Coming from #1765 I am new to this long-standing discussion and my thinking is focused on continuous uniform distributions.

Because the type of parameters and the type of the random numbers are two different things, we should keep them separate concepts. This, however, is not straightforward e.g. with AffineDistribution that transforms both parameter-derived quantities (like mode, ...) and samples.
(For some quantities though, e.g. the minimum or quantile, its not immediately clear if its parameter-like or sample-like)

The specification of type of samples with the rand functions as proposed by @devmotion has its limitations where the user is not the direct caller of the rand function. E.g. with Turing, I specify x ~ Normal(), but the call to rand is done somewhere inside Turing, where I cannot control to get Float32 or Float64. Hence, I would prefer a solution where the eltype of the sample can be specified with the creation of the distribution object.

How about introducing a second parametric type for the element-type S of the sample, in addition to the parametric type T for the distribution parameters.

Then we could have

Base.partype(::Type{Uniform{T,S}}) where {T,S} = T
Base.eltype(::Type{Uniform{T,S}}) where {T,S} = S

and letting S default to Float64 or Int64 when not explicitly specified with the construction of the Distribution object, or for distributions where this parametric type is not yet implemented.

The AffineDistribution can also receive two parametric types. Given its construction with parameters mu and sigma of type P and source distribution rho then I suggest:

S = promote_type(P, eltype(rho))
T = promote_type(P, partype(rho))

If S is defined as element-type instead of the type of the entire sample, does it still confer the problems of sampling structured samples raised by @cscherrer?

@quildtide
Copy link
Contributor

I agree with the suggestions from @bgctw.

@quildtide
Copy link
Contributor

Well, on second thought, the counterargument is that it might be better to go with this interface, and then have Turing.jl allow users to specify sampling types that it then sends to rand.

Moving eltype into constructors themselves is more likely to propagate in weird ways to existing code, while the solution in this pull req is less likely to affect existing 3rd party code until explicit support is added (both a good thing and a bad thing, but probably more of a good thing overall since updating the big rand(d::Distribution) users like Turing.jl probably shouldn't be too hard)

@devmotion
Copy link
Member Author

I don't think this is the right approach. At least for continuous distributions, IMO one should not encode the type of the samples in the distribution type - basically, similar to how typically in Julia functions are not typed with their return type. Trying to reason about return/sample types is brittle and prone to unexpected bugs, and it limits the dynamic nature of Julia. Similar to regular functions, IMO one should just push forward the result of the "basic" non-Distributions rand call which would then automatically be promoted (if necessary) based on e.g. the distribution parameters. That is, one would something along the lines of rand(T, dist) = quantile(d, rand(T)).

That being said, for discrete distributions it is a slightly different story - since the return type of quantile(d, p) is typically disconnected from the type of p. But one the other hand, maybe that's fine and natural as well - often a fixed return type of quantile might be OK, sometimes there might be an obvious choice (e.g., the element type of d.support for d::DiscreteNonParametric), and as a last resort if necessary one could make some discrete distributions a bit more flexible by encoding the support type/return type of quantile in the distribution type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Poisson rand in Float32? rand(dist) does not consistently honor the element type of dist
5 participants