Making Kriging differentiable and better initial hyperparams #368

archermarx · 2022-07-07T19:55:02Z

In order to make hyperparameter optimization of gaussian process models simpler (cf #328 and #328), it would be nice to be able to use AD on kriging surrogates. This turns out not to be too difficult. Making only a single change to the computation of Kriging coefficients makes these surrogates differentiable with respect to their hyperparameters

Setup:

using Surrogates
using LinearAlgebra 

function branin(x)
    x1 = x[1]
    x2 = x[2]
    b = 5.1 / (4*pi^2);
    c = 5/pi;
    r = 6;
    a = 1;
    s = 10;
    t = 1 / (8*pi);
    term1 = a * (x2 - b*x1^2 + c*x1 - r)^2;
    term2 = s*(1-t)*cos(x1);
    y = term1 + term2 + s;
end

n_samples = 21
lower_bound = [-5.0, 0.0]
upper_bound = [10.0, 15.0]

xys = sample(n_samples, lower_bound, upper_bound, SobolSample())
zs = branin.(xys);

# good hyperparams for the Branin function
p = [2.0, 2.0]
theta = [0.03, 0.003]

kriging_surrogate = Kriging(xys, zs, lower_bound, upper_bound, p, theta)

# Log likelihood function of a Gaussian process model, AKA what we want to optimize in order to tune our hyperparameters
function kriging_logpdf(params, x, y, lb, ub)
    d = length(params) ÷ 2
    theta = params[1:d]
    p = params[d+1:end]
    surr = Kriging(x, y, lb, ub; p, theta)

    n = length(y)
    y_minus_1μ = y - ones(length(y), 1) * surr.mu
    Rinv = surr.inverse_of_R

    term1 = only(-(y_minus_1μ' * surr.inverse_of_R * y_minus_1μ) / 2 / surr.sigma)
    term2 = -log((2π * surr.sigma)^(n/2) / sqrt(det(Rinv)))

    logpdf = term1 + term2
    return logpdf
end

loss_func = params -> -kriging_logpdf(params, xys, zs, lower_bound, upper_bound)

Just to check that this works, here's how the model performs

Now let's try taking a gradient with respect to our hyperparameters $p, \theta$

using Zygote

Zygote.gradient(loss_func, [p; theta])

We get an error (ERROR: Mutating arrays is not supported -- called setindex!(Matrix{Float64}, ...)) which comes from the _calc_kriging_coeffs function when we build the covariance matrix R. We can replace the mutating part of this function with a matrix comprehension as follows:

old:

function _calc_kriging_coeffs(x, y, p, theta)
    n = length(x)
    d = length(x[1])
    R = zeros(float(eltype(x[1])), n, n)
    @inbounds for i in 1:n
        for j in 1:n
            sum = zero(eltype(x[1]))
            for l in 1:d
                sum = sum + theta[l] * norm(x[i][l] - x[j][l])^p[l]
            end
            R[i, j] = exp(-sum)
        end
    end

    one = ones(n, 1)
    one_t = one'
    inverse_of_R = inv(R)

    mu = (one_t * inverse_of_R * y) / (one_t * inverse_of_R * one)

    y_minus_1μ = y - one * mu

    b = inverse_of_R * y_minus_1μ

    sigma = (y_minus_1μ' * inverse_of_R * y_minus_1μ) / n

    mu[1], b, sigma[1], inverse_of_R
end

new:

function _calc_kriging_coeffs(x, y, p, theta)
    n = length(x)
    d = length(x[1])
    R = [
        let
            sum = zero(eltype(x[1]))
            for l in 1:d
                sum = sum + theta[l] * norm(x[i][l] - x[j][l])^p[l]
            end
            exp(-sum)
        end

        for j in 1:n, i in 1:n
    ]

    one = ones(n, 1)
    one_t = one'
    inverse_of_R = inv(R)

    mu = (one_t * inverse_of_R * y) / (one_t * inverse_of_R * one)

    y_minus_1μ = y - one * mu

    b = inverse_of_R * y_minus_1μ

    sigma = (y_minus_1μ' * inverse_of_R * y_minus_1μ) / n

    mu[1], b, sigma[1], inverse_of_R
end

With that, we can compute a gradient!

julia> Zygote.gradient(loss_func, [0.03, 0.0034, 2.0, 2.0])
([-654.8265400112359, 1784.234854500508, 5532.336307947245, 332.2643207182241],)

The text was updated successfully, but these errors were encountered:

archermarx · 2022-07-07T19:56:40Z

Some further improvement ideas:

nugget regularization/noise variance (adding a small value to the main diagonal to prevent R from being singular when points are close to each other, would also allow Gaussian process regression instead of only interpolation)
Included hyperparameter optimization (could do either multi-start gradient-based optimization, taking care to remain in bounds of 0 <= p <= 2 or some sort of bounded global optimization)
The default hyperparameters are really bad for most functions. Would recommend providing sensible defaults (p = 2 * ones(d) gives a smooth square exponential kernel, versus the current default of ones(d), which provides a linear function, and theta could be guesstimated based on the size of the box and the spread of the data)

ChrisRackauckas · 2022-07-07T20:56:56Z

Oh nice, can you open a PR?

Note a similar discussion is occurring in #366

The default hyperparameters are really bad for most functions. Would recommend providing sensible defaults (p = 2 * ones(d) gives a smooth square exponential kernel, versus the current default of ones(d), which provides a linear function, and theta could be guesstimated based on the size of the box:)

That sounds like a good idea. Smooth square exponential kernel is more expected as a default too.

archermarx · 2022-07-08T20:22:48Z

Made a PR for the differentiability. For the initial hyperparams, I propose:

$$p_i = 2, i \in [1, d]$$

$$\theta_i = \frac{1}{2}\frac{1}{\mathrm{variance}(\mathbf{X_i})}, i\in [1, d]$$

The latter comes from A Practical Guide to Gaussian Processes and the interpretation of $\theta_i = 1/2l_i^2$, where $l_i$ is a characteristic correlation length scale for variable $i$. Scaling the length scale with the standard deviation of the samples makes the hyperparameter initialization independent of the box size, which is a very desirable property. With this initialization, here's some comparisons to the default initialization:

Sphere function

Old

New

L1-norm function

Old

New

Branin function

Old

New

Rosenbrock

Old

New

Clearly, the new hyperparameters are a significant improvement.

I'll make a PR with this, but we'd also want to update all of the documentation examples using Kriging to use the new default hyperparameters as the current examples don't exactly give the most confidence in the method's performance.

ChrisRackauckas · 2022-07-12T03:49:09Z

Fixed by #374

archermarx changed the title ~~Making Kriging differentiable~~ Making Kriging differentiable and better initial hyperparams Jul 7, 2022

ChrisRackauckas mentioned this issue Jul 7, 2022

Add Hyperparam Optimization for GEKPLS Theta #366

Closed

vikram-s-narayan mentioned this issue Jul 8, 2022

Make GEKPLS differentiable #371

Closed

archermarx mentioned this issue Jul 8, 2022

make kriging differentiable #372

Merged

This was referenced Jul 8, 2022

Kriging Produces Nearly Identical Results for All Inputs After 3 Dimensions #369

Closed

New default hyperparameters and miscellaneous bug fixes for Kriging. #374

Merged

ChrisRackauckas closed this as completed Jul 12, 2022

chenr86 mentioned this issue Jul 27, 2022

Hyperparameter optimization for kriging model #388

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making Kriging differentiable and better initial hyperparams #368

Making Kriging differentiable and better initial hyperparams #368

archermarx commented Jul 7, 2022 •

edited

Loading

archermarx commented Jul 7, 2022 •

edited

Loading

ChrisRackauckas commented Jul 7, 2022

archermarx commented Jul 8, 2022

ChrisRackauckas commented Jul 12, 2022

Making Kriging differentiable and better initial hyperparams #368

Making Kriging differentiable and better initial hyperparams #368

Comments

archermarx commented Jul 7, 2022 • edited Loading

archermarx commented Jul 7, 2022 • edited Loading

ChrisRackauckas commented Jul 7, 2022

archermarx commented Jul 8, 2022

Sphere function

L1-norm function

Branin function

Rosenbrock

ChrisRackauckas commented Jul 12, 2022

archermarx commented Jul 7, 2022 •

edited

Loading

archermarx commented Jul 7, 2022 •

edited

Loading