Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adaptation to sklearn #143

Open
MuellerSeb opened this issue Mar 26, 2020 · 15 comments
Open

Adaptation to sklearn #143

MuellerSeb opened this issue Mar 26, 2020 · 15 comments

Comments

@MuellerSeb
Copy link
Member

MuellerSeb commented Mar 26, 2020

sklearn provides a lot of functionality, we could use to simplify our code.

Distance matrix calculation:
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html

KDtree:
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html

Search radius to optimize moving window kriging (See: #57):
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html#sklearn.neighbors.KDTree.query_radius

Nearest neighbors:
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html#sklearn.neighbors.KDTree.query

Beside that, one can specify the metric in use (see: #120):
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html

  • “euclidean” | EuclideanDistance |   | sqrt(sum((x - y)^2))
  • “manhattan” | ManhattanDistance |   | sum(|x - y|)
  • “chebyshev” | ChebyshevDistance |   | max(|x - y|)
  • “minkowski” | MinkowskiDistance | p | sum(|x - y|^p)^(1/p)
  • “wminkowski” | WMinkowskiDistance | p, w | sum(|w * (x - y)|^p)^(1/p)
  • “seuclidean” | SEuclideanDistance | V | sqrt(sum((x - y)^2 / V))
  • “mahalanobis” | MahalanobisDistance | V or VI | sqrt((x - y)' V^-1 (x - y))

Or for geo-coordinates (see: #121):

  • “haversine” | HaversineDistance | 2 arcsin(sqrt(sin^2(0.5dx) + cos(x1)cos(x2)sin^2(0.5dy)))
    (np.deg2rad needed here)

So this could solve a lot of issues.

@rth
Copy link
Contributor

rth commented Mar 26, 2020

Yes, I think that would be a good idea. I'm all for better integration with scikit-learn in general #31

When improving the API for 2.0, making one that's compatible with scikit-learn if possible would be useful https://scikit-learn.org/stable/developers/develop.html#rolling-your-own-estimator

@MuellerSeb
Copy link
Member Author

@rth Could you draft a general workflow for the integration with sklearn? I have to learn a bit more about sklearn to get to know the dos and don'ts..

@rth
Copy link
Contributor

rth commented Mar 26, 2020

Essentially one could start with minimal functionality along the lines of,

from sklearn.base import BaseEstimator

class OrdinaryKrigging(BaseEstimator):
    def __init__(self, ...):
        ...
 
    def fit(self, X, y):
        ...
    
    def predict(self, X, return_std=False):
        ...

where X is (n_samples, n_features) array (with for now n_features = 2 or 3 being supported only).

Looking at the usage of GaussianProcessRegressor in scikit-learn could be a start. Then one would progressively add features, and form time to time run check_estimator to make sure the API is compatible.

We already have a scikit-learn wrapper class BTW

class Krige(RegressorMixin, BaseEstimator):
, but the idea could be to make this more of a default API.

@MuellerSeb
Copy link
Member Author

So we should have a look at:
https://github.com/scikit-learn-contrib/project-template

@rth
Copy link
Contributor

rth commented Mar 26, 2020

So we should have a look at: scikit-learn-contrib/project-template

Worth looking into but there are also things outdated there.

@MuellerSeb
Copy link
Member Author

OK. In the end I am thinking of a single class for kriging where the dimension is just a parameter. And since Ordinary kriging is just universal kriging with a constant drift (or simply an unbiased estimator), we could combine all 4 kriging classes that are present at the moment.
When providing these in the proposed sklearn way, we can even get rid of the RK class. Which sounds nice to me.

You once mentioned that anisotropy and rotation could be provided by pipelines (#138).
How does that work?

@rth
Copy link
Contributor

rth commented Mar 27, 2020

OK. In the end I am thinking of a single class for kriging where the dimension is just a parameter. And since Ordinary kriging is just universal kriging with a constant drift (or simply an unbiased estimator), we could combine all 4 kriging classes that are present at the moment.
When providing these in the proposed sklearn way, we can even get rid of the RK class. Which sounds nice to me.

Yeah, the implementation could be a single class for kriging. The dimension could just be determined from the X.shape[-1] passed to fit method. Though having OrdinalKriging / UniversalKriging aliases still might not hurt (if only so that people find them when searching for those key words online). One of the current issue is that the classes have too many parameters.

You once mentioned that anisotropy and rotation could be provided by pipelines (#138). How does that work?

The idea is that if some of those calculations, are stand alone transformations of coordinates (i.e. X (n_samples, n_features), where features = ['x', 'y']) converted to X_transformed with the same shape, then it could be a stand-alone pre-processing class. And pipeline could be built using make_pipeline (see docstring example in the link) from scikit-learn. This allows to make things more modular. For instance,

from sklearn.pipeline.make_pipeline

pipe = make_pipeline(Preprocessor(), OrdinalKriging())
pipe.fit(coords, target)
z = pipe.predict(coords_grid)

where internally for pipe.fit, it would first call preprocessor.fit, then preprocessor.transform to get the transformed coordinates, and finally krigging.fit on these coordinates.
Then for pipe.predict it would first call reprocessor.transform then krigging.predict on those transformed coordinates.

This logic is built-in in scikit-learn pipelines, and it's possible to chain muliple processing steps like this. And also combine estimators from different packages, assuming they follow the scikit-learn API.

I haven't done a detailed analysis to see to what extent it's possible in pykrige. One limitation of scikit-learn pipelines is that the target variable (previously known as z) cannot be transformed in pipeline steps. But there are workarounds if it's really a blocker.

@MuellerSeb
Copy link
Member Author

One question I have about the pre-processor is: What should be fitted?
Only thing we want is, to transform the coordinates with the rotation-angles and anisotropy ratios given (we can't fit these, since we would need to fit a spatial variogram to the data to achive this, which is quite hard for a sparse data set)

So we would only use transform with fixed parameters (angles, ratios), right?

@rth
Copy link
Contributor

rth commented Mar 28, 2020

One question I have about the pre-processor is: What should be fitted?
Only thing we want is, to transform the coordinates with the rotation-angles and anisotropy ratios given (we can't fit these, since we would need to fit a spatial variogram to the data to achive this, which is quite hard for a sparse data set)

If fit does nothing, it's fine. But it's still needs to be there for API compatibility,

from sklearn.base import BaseEstimator, TransformerMixin

class Preprocessor(BaseEstimator, TransformerMixin):  # or give it a better name
    def __init__(self, parameter="some_default"):
        self.parameter = parameter

    def fit(self, X, y):
       """This estimator is stateless, fit does nothing."""
        return self
    
    def transform(self, X):
       # actual transformation here
       return X_transformed 

The base classes are necessary to,

  • BaseEstimator marks it as compatibile with scikit-learn API and also defines set_params / get_params methods, to set/get parameters passed in __init__,
  • TransformerMixin marks it as a transformer, and defines,
def fit_transform(X, y):
    return self.fit(X, y).transform(X)

which in this case is equivalent to just a transform.

@MuellerSeb
Copy link
Member Author

MuellerSeb commented Apr 6, 2020

I was digging through the sklearn code and it seems, that you can't use the kdtree directly in cython (link). Since it would be good to prevent the python-overhead in creating the distance matrix, I would suggest forking the critical parts of sklearn.neighbors (link):

  • _binary_tree.pxi - Base class for the kdtree
  • _kd_tree.pyx - kdtree itself
  • _dist_metrics.pyx - the provided distance metrics
  • _dist_metrics.pxd - header file for the distance metrics
  • _typedefs.pyx - type definitions
  • _typedefs.pxd - header file for type definitions

Only thing that needs to be modified, is commenting out from ..utils import check_array, and all occurrences of check_array in _binary_tree.pxi (4 times), since we can take care of the input our selfs (needs to be np.float64 array). Then, the 6 files above should be enough. We only need to create a header file _kd_tree.pyd (which is not present in sklearn, that's why the effort), to use the kdtree directly in cython.

advantages:

  • we don't need to create the full distance matrix in python (which can be huge for big data sets (~10000 points)
  • when implementing the moving window, we can then simply:
    1. loop over the points in cython
    2. get all nearest neighbors
    3. create the kriging matrix directly in cython
    4. solve the kriging equation in cython (even parallelizable with prange and openmp)

ATM the kriging matrix is created in python and passed to cython (when doing the full kriging). With n nearest neighbors, a list of points is passed to cython to cut out the reduced kriging matrix (again, the full matrix is present)

The idea would be, to create a companion class to the future kriging class in cython, that holds all necessary information and does the number-crushing. All we need is the variogram in cython, which is already present.

We could also cut down the functionality of the kdtree to only provid:

  • query_radius - for moving window
  • query - for n nearest neighbors (which is provided by pykrige ATM)

So we can keep the external code base at a minimum.

@rth, @bsmurphy what do you think?

@rth
Copy link
Contributor

rth commented Apr 7, 2020

Do you think there is no way of doing this without using KDTree/Balltree from cython? E.g. computing the calculation in small batches or using sparse distance matrices. See e.g. KNeighborsTransformer and the Aproximate nearest neighbours example.

I would suggest forking the critical parts of sklearn.neighbors

That code is legacy and quite convoluted (cf e.g. scikit-learn/scikit-learn#4217) I would not recommend forking (and maintaining) it unless there is no way around it.

is commenting out from ..utils import check_array, and all occurrences of check_array in _binary_tree.pxi (4 times), since we can take care of the input our selfs (needs to be np.float64 array).

There is a SLEP 10 to replace check_array with _validate_data method in scikit-learn (which can be overwritten in contrib projects). It was partially merged scikit-learn/scikit-learn#16112 and will be available in the next release (in ~1 month or so). If some estimators are not yet done it should be fairly easy to add them.

@MuellerSeb
Copy link
Member Author

If we can't use the KDTree in cython, the loop over the gird-points needs to be done in python, or the full-distance matrix needs to be constructed and passed to cython to do the loop there. In the first case, we get a massive overhead in time, in the second case we get a massive overhead in memory consumption.

It seems, that the sklearn team doesn't want to provide kdtree for cython, since they think, it is to much of an maintainance overhead.

I'll think about it.

@rth
Copy link
Contributor

rth commented Apr 7, 2020

or the full-distance matrix needs to be constructed and passed to cython to do the loop there [...] in the second case we get a massive overhead in memory consumption.

Not if we use a sparse distance matrix (that only stores K nearest neigbours for each point).

It seems, that the sklearn team doesn't want to provide kdtree for cython, since they think, it is to much of an maintainance overhead.

You could try opening an issue about it there, but I do anticipate mostly negative feedback indeed due to maintenance costs.

@MuellerSeb
Copy link
Member Author

If we use only K nearest neighbors, we still can't provide a search radius for the moving window. K-nearest neigbors always depend on the densitiy of the input data.

@rth
Copy link
Contributor

rth commented Apr 7, 2020

If we use only K nearest neighbors, we still can't provide a search radius for the moving window. K-nearest neigbors always depend on the densitiy of the input data.

I think it should also be possible to create a sparse distance matrix for neighbors within a given radius: i.e. a KNeighborsTransformer but that uses radius_neighbors_graph internally instead of kneighbors_graph from NearestNeighbors. But I guess not both of those at the same time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants