Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Haversine Distance for Geographic Coordinates #108

Open
soupoder opened this issue Jun 4, 2021 · 9 comments
Open

Haversine Distance for Geographic Coordinates #108

soupoder opened this issue Jun 4, 2021 · 9 comments
Labels

Comments

@soupoder
Copy link

soupoder commented Jun 4, 2021

I am wondering if its possible to generate a variogram using haversine/great-circle as the distance metric? This doesn't seem to be supported by scipy.spatial.distance.pdist but would be helpful for working directly with geographic coordinates. Is it possible to supply a user defined function to calculate haversine distance?

@mmaelicke
Copy link
Owner

Hey there,
Thanks for your question.
Well, while haversine is not implemented into scikit-gstat, The Variogram.dist_func argument does take a callable. The function signature has to be like:

def haversine(p1, p2) -> float:
  pass

Where p1, p2 are iterated along the Variogram.coordintes numpy array. Thus, if the points are 2D in this case, p1.shape == (1,2)

or with Rosetta code's solution:

from math import radians, sin, cos, sqrt, asin
 
def haversine(p1, p2):
    # my adaption to fit the function signature  
    lat1, lon1 = p1
    lat2, lon2 = p2   

    # original
    R = 6372.8  # Earth radius in kilometers
 
    dLat = radians(lat2 - lat1)
    dLon = radians(lon2 - lon1)
    lat1 = radians(lat1)
    lat2 = radians(lat2)
 
    a = sin(dLat / 2)**2 + cos(lat1) * cos(lat2) * sin(dLon / 2)**2
    c = 2 * asin(sqrt(a))
 
    return R * c

For performance reasons, you might want to implement a numpy solution.

Maybe as a bit of background and for others that might come across this issue: The haversine formula is pretty straightforward, but tends to become quite imprecise, especially on small scales, which I am personally usually working on. An alternative to haversine is Vincenty's solution, which is more accurate, as far as I know.
Finally, with great-circle distance, one has to be aware that you should not rely on the distances calculated, what the variogram actually does. Thus, if used, one should note take the computed effective range as a precise estimation of correlation lengths in a strict sense and use the variogram more as a rough estimation to see patterns.

Finally, I am happy to take a pull request if you want to contribute your solution as a default option to scikit-gstat.
Best,

Mirko

@soupoder
Copy link
Author

soupoder commented Jun 8, 2021

Much thanks for your reply Mirko,

I scrapped the following function to use numpy to run the great circle distance calculation to pass as a callable to dist_func:

`def haversine(p1, p2):

lat1, lon1 = p1
lat2, lon2 = p2    

lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

dlon = lon2 - lon1
dlat = lat2 - lat1

a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

c = 2 * np.arcsin(np.sqrt(a))
km = 6367 * c
return km`

However it appears to be running quite slowly compared to the standard Euclidean distance (running on ~30k coords). Looking into it, I am wondering if this is because euclidean distance is able to take advantage of a cKDTree, whereas my custom function is doing a brute search? If so I am wondering if there would be any way to use something like sklearn.neighbors.BallTree (which supports haversine distance) to speed up the process?

@mmaelicke
Copy link
Owner

Hey there,

Yes, the Variogram class is using an instance of skgstat.MetricSpace to handle coordinates. In case you need to calculate the same set of coordinates for different Variogram over and over again, it is also possible to first create the MetricSpace and then pass it as the coordinates attribute. This will prevent re-calculations. This first creation will, however, still calculate the full distance matrix and take its time for 30k coords.

I think, generally, it should be possible to use a BallTree, however, just passing it is not possible. We would need to tweak the MetricSpace. The cDKTree is created here:

self._tree = cKDTree(self.coords)

while the class is explicitly checking for euclidean distance here:

if self.dist_metric != "euclidean":

to raise an Error if a MetricSpace.tree is accessed and here:

if self.max_dist is not None and self.dist_metric == "euclidean":

to check whether a maximum distance is set. Otherwise, pdist would be used.

My first thoughts here:

  • verify that BallTree and cDKTree share an interface (only .sparse_distance_matrix is used if I recall correctly). I am not familiar with BallTree... so I can't say if this is even possible
  • The first check could be extended to anything BallTree supports

The last check (line 138) however, would need some adaptions to the logic. Here, the existing check could be kept and a new elif is needed to check for haversine or anything else BallTree supports. The else could then be kept as is.
This would imply that haversine is always calculated with BallTree, thus a quick benchmark of the numpy solution vs BallTree solution would be needed, otherwise, the elif has to include the check for self.max_distas well (line 138)

If you would like to contribute this, I am happy to take any pull request, otherwise, I can try to implement this soon and then a review and test by you would be highly appreciated.
@redhog do you have any thoughts, ideas or opinions on this?

Best
Mirko

@MuellerSeb
Copy link
Collaborator

MuellerSeb commented Jun 8, 2021

My two cents on this topic:
Be careful with directly using the haversine distance in the variogram model. The surface of the sphere is a compact manifold (mathematically speaking) and the resulting model when using the haversine distance is not valid anymore (conditional negative semidefinite - CNSD as phrased by Webster 2007).

To overcome this problem, one could use the associated Yadrenko-model for a given spatial 3D variogram model. This uses the chordal distance derived from the geographical coordinates.

We had several discussions about that in PyKrige and GSTools:

When using haversine in standard models and using them later on for building up a kriging matrix, this could result in undefined behavior since there the "conditional negative semidefinite" part is crucial.

Cheers,
Sebastian

@MuellerSeb
Copy link
Collaborator

MuellerSeb commented Jun 8, 2021

@mmaelicke you could think about implementing a switch to use the assoziated yadrenko model, that is able to use the haversine distance:

Starting with a valid isotropic model in 3D, where the covariance between two points is given as a function of their distance:

cov(x1, x2) = func(dist(x1, x2))

We can construct the related yadrenko model on the sphere, that takes the great-circle distance zeta(x1, x2) with the following modification:

cov_sph(x1, x2) = func(2 * sin(zeta(x1, x2) / 2))

So instead of passing the distance d to the variogram, you should pass 2 * np.sin(d/2) when working with the haversine distance. This could also increase the interoperability between GSTools and scikit-gstat 😉

PS: See the GSTools tutorials for more informations: https://geostat-framework.readthedocs.io/projects/gstools/en/stable/examples/08_geo_coordinates/index.html

@mmaelicke
Copy link
Owner

Hey, @MuellerSeb thanks for all the insights!
I wasn't aware of these implications, that is very good to know. I think I need to look into the paper in some more detail.
Just for clarification: When using haversine, do I need to replace the theoretical variogram model with the Yadrenko model, or does it adapt modeled covariances (so kind of wrapping the used model)?
In any case, I would have to work with covariances, right? Or can I apply it to semi-variances, as well?
Cheers

@MuellerSeb
Copy link
Collaborator

It applies to semi-variances as well. Yadrenko derived a way to generate a huge family of valid models on the sphere by making use of valid models in 3D. Since all models in scikit-gstat seem to be valid in 3D, you can derive the associated yadrenko models for all of them by adopting the passed distance in the variogram function.

You can also think this way: When you have lat-lon coordinates, you can derive the spatial point in 3D on a unit-sphere and then calculate the "tunnel distance":

This is ultimately the same as deriving the chordal distance for the great-circle distance.

@redhog
Copy link
Collaborator

redhog commented Jun 9, 2021

I might be a tiny bit out of my depth here, but if you want to make a variogram of, and later krige, positions in a non-euclidean 2d space that does have a euclidean embedding in 3d, why would you not convert the coordinates to that 3d embedding first, then do all calculations in euclidean 3d, then convert the results back at the end?

The earth isn't a sphere, so the simple geometrical implementations of great circles above aren't gonna give the correct results, and the user might want to be able to chose char datum dependent on their application/locality...

@MuellerSeb
Copy link
Collaborator

That is exactly the reasoning behind the yadrenko models and is also the exact way that GSTools treats lat-lon input. We provide two different dimension attributes: dim for the model dimension (3 here) and field_dim for the actual dimension (2 here)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants