Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ordinary Kriging Complexity #140

Closed
antunesleo opened this issue Mar 14, 2020 · 5 comments
Closed

Ordinary Kriging Complexity #140

antunesleo opened this issue Mar 14, 2020 · 5 comments
Labels

Comments

@antunesleo
Copy link

What's the ordinary kriging complexity?

I've been searching for the ordinary kriging algorithm complexity but couldn't find it. I found a lot of papers and academic stuff, but I don't have the background to understand it. Big O notations it's something that I can actually understand before studying the kriging process deeply!

The complexities of the other kriging types is usefull as well.

I appreciate if anyone can help me

@rth
Copy link
Contributor

rth commented Mar 14, 2020

Independently of the theoretical complexity the most reliable apporach I think is to measure it. For instance on randomly distributed data, changing the number of samples and running krigging on a 100x100 grid the script below produces,

          wall_time (s)       peak_memory (MB)                                                                                                
method     __init__ execute    __init__  execute
n_samples                                       
2000           0.11    1.46        0.09   817.01
4000           0.45    5.66      122.03  1891.70
8000           1.97   26.11      619.51  4228.58

so the time complexity is roughly O(n_samples²) both for the __init__ and the execute method. The memory usage complexity also seem to be O(n_samples²), however it can be improved by using a different backend than vectorized in exectute method.

Overall there should be room for improvement I think.

benchmark.py (run with OMP_NUM_THREADS=2 python benchmark.py)

import numpy as np
import pykrige.kriging_tools as kt
from pykrige.ok import OrdinaryKriging
from neurtu import Benchmark, delayed

rng = np.random.RandomState(0)

max_samples = 100000

data = 10*rng.rand(max_samples, 2)
z = rng.rand(max_samples)


gridx = np.linspace(0, 10, 100)
gridy = np.linspace(0, 10, 100)

kwargs = dict(variogram_model='linear',         
              verbose=False, enable_plotting=False)

def bench_cases():
    for n_samples in [2000, 4000, 8000]:
       tags =  {'n_samples': n_samples, 'method': '__init__'}

       ok = delayed(OrdinaryKriging, tags=tags)(
               data[:n_samples, 0], data[:n_samples, 1], z[:n_samples])
       yield ok
       tags =  {'n_samples': n_samples, 'method': 'execute'}
       ok = OrdinaryKriging(data[:n_samples, 0], data[:n_samples, 1], z[:n_samples])
       yield delayed(ok, tags=tags).execute('grid', gridx, gridy)
   
res = Benchmark(wall_time=True, peak_memory=True)(bench_cases())
print(res.unstack(-1).round(2))

you can adapt this script to evaluate the complexity with respect to other parameter (e.g. grid size etc).

@antunesleo
Copy link
Author

Thank you @rth, this script helped a lot! I had an issue printing the output but I did a workaround, so my outputs are not cool like yours. Follow the error.

print(res.unstack(-1).round(2))

Traceback (most recent call last):                                              
  File "benchmark.py", line 32, in <module>
    print(res.unstack(-1).round(2))
AttributeError: 'list' object has no attribute 'unstack'

I did some simple experiments that I would like to share.

TIME COMPLEXITY

Samples time complexity

Checking time variation between 50 to 6400 samples

{'n_samples': 50, 'method': 'execute', 'wall_time': 0.007776964411759228, 'peak_memory': 0.0}
{'n_samples': 100, 'method': 'execute', 'wall_time': 0.016293986124992443, 'peak_memory': 0.0}
{'n_samples': 200, 'method': 'execute', 'wall_time': 0.03170493699999497, 'peak_memory': 0.0}
{'n_samples': 400, 'method': 'execute', 'wall_time': 0.06674067249991822, 'peak_memory': 0.0}
{'n_samples': 800, 'method': 'execute', 'wall_time': 0.2111246059998848, 'peak_memory': 64.69921875}
{'n_samples': 1600, 'method': 'execute', 'wall_time': 0.5659971460004272, 'peak_memory': 152.55078125}
{'n_samples': 3200, 'method': 'execute', 'wall_time': 2.3771720049999203, 'peak_memory': 461.609375}
{'n_samples': 6400, 'method': 'execute', 'wall_time': 13.476873717000217, 'peak_memory': 1122.82421875}

If we plot this in a chart and compare to a chart for O(n2) we can see that it's a little bit complexy than O(n2), but almost the same thing. (This isn't a controlled experiment)

O(n2) versus n (samples)
time (seconds) versus n (samples)

So, I think we can say samples time complexity is O(n2).

Grid time complexity

Checking the time variation between 625 (25 x 25) to 160000 (400 x 400) grid cells

25 x 25 grid (625 cells), 50 to 1600 samples

{'n_samples': 25, 'method': 'execute', 'wall_time': 0.0023299097547179663, 'peak_memory': 0.0}                                                  
{'n_samples': 50, 'method': 'execute', 'wall_time': 0.003100287200004459, 'peak_memory': 0.0}
{'n_samples': 100, 'method': 'execute', 'wall_time': 0.005714163655158763, 'peak_memory': 0.0}
{'n_samples': 200, 'method': 'execute', 'wall_time': 0.010084943200005607, 'peak_memory': 0.0}
{'n_samples': 400, 'method': 'execute', 'wall_time': 0.02489943083325367, 'peak_memory': 0.0}
{'n_samples': 800, 'method': 'execute', 'wall_time': 0.0893493009998565, 'peak_memory': 0.0}
{'n_samples': 1600, 'method': 'execute', 'wall_time': 0.3013877789999242, 'peak_memory': 58.51953125}

50 x 50 grid (2500 cells), 50 to 1600 samples

{'n_samples': 25, 'method': 'execute', 'wall_time': 0.004198298941178911, 'peak_memory': 0.0}                                                    
{'n_samples': 50, 'method': 'execute', 'wall_time': 0.008267518111122425, 'peak_memory': 0.0}
{'n_samples': 100, 'method': 'execute', 'wall_time': 0.015475255599994853, 'peak_memory': 0.0}
{'n_samples': 200, 'method': 'execute', 'wall_time': 0.03131875079998281, 'peak_memory': 0.0}
{'n_samples': 400, 'method': 'execute', 'wall_time': 0.07004850750013247, 'peak_memory': 0.0}
{'n_samples': 800, 'method': 'execute', 'wall_time': 0.16749932599987005, 'peak_memory': 0.0}
{'n_samples': 1600, 'method': 'execute', 'wall_time': 0.5587183059997187, 'peak_memory': 115.80859375}

100 x 100 grid (10000 cells, 4x bigger than the previous one), 50 to 1600 samples

{'n_samples': 25, 'method': 'execute', 'wall_time': 0.01494611433334588, 'peak_memory': 0.0}                                                     
{'n_samples': 50, 'method': 'execute', 'wall_time': 0.02806569128571417, 'peak_memory': 0.0}
{'n_samples': 100, 'method': 'execute', 'wall_time': 0.05157456866678937, 'peak_memory': 0.0}
{'n_samples': 200, 'method': 'execute', 'wall_time': 0.10322511200001827, 'peak_memory': 0.0}
{'n_samples': 400, 'method': 'execute', 'wall_time': 0.25305722599978253, 'peak_memory': 91.58203125}
{'n_samples': 800, 'method': 'execute', 'wall_time': 0.5502379879999353, 'peak_memory': 244.7421875}
{'n_samples': 1600, 'method': 'execute', 'wall_time': 1.4694017530000565, 'peak_memory': 648.85546875}

200 x 200 grid (40000 cells, 4x bigger than the previous one), 50 to 1600 samples

{'n_samples': 25, 'method': 'execute', 'wall_time': 0.0507471066666767, 'peak_memory': 0.0}                                                      
{'n_samples': 50, 'method': 'execute', 'wall_time': 0.09184435899987875, 'peak_memory': 0.0}
{'n_samples': 100, 'method': 'execute', 'wall_time': 0.18824201999996149, 'peak_memory': 100.34375}
{'n_samples': 200, 'method': 'execute', 'wall_time': 0.37816304099987974, 'peak_memory': 244.765625}
{'n_samples': 400, 'method': 'execute', 'wall_time': 0.8037154940002438, 'peak_memory': 489.2109375}
{'n_samples': 800, 'method': 'execute', 'wall_time': 1.7659289359999093, 'peak_memory': 1255.6328125}
{'n_samples': 1600, 'method': 'execute', 'wall_time': 4.836020948000169, 'peak_memory': 2663.15234375}

400 x 400 grid (160000 cells, 4x bigger than the previous one), 50 to 1600 samples

{'n_samples': 25, 'method': 'execute', 'wall_time': 0.28511154499983604, 'peak_memory':95.1484375}                                                   
{'n_samples': 50, 'method': 'execute', 'wall_time': 0.4336792220001371, 'peak_memory': 267.8125}
{'n_samples': 100, 'method': 'execute', 'wall_time': 0.7693347370000083, 'peak_memory': 491.953125}
{'n_samples': 200, 'method': 'execute', 'wall_time': 1.692223333000129, 'peak_memory': 1251.76953125}
{'n_samples': 400, 'method': 'execute', 'wall_time': 3.5027248790001977, 'peak_memory': 2652.76953125}
{'n_samples': 800, 'method': 'execute', 'wall_time': 9.453671161999864, 'peak_memory': 5366.30078125}
{'n_samples': 1600, 'method': 'execute', 'wall_time': 23.468780739000067, 'peak_memory': 10734.34765625}

If get all results for 100 samples for example and plot in a chart, we can see a linear grow.
grid cells versus time (seconds)

Conclusion

If we say samples is n and number of grid cells is m, can we say that the complexity time for grid and samples is O(mn2)?

### Memory Usage Complexity
Memory usage seems to be O(n2) for samples and O(m) for number of grid cells too, so O(mn2).

Is this correct?

Thanks for the help!

@rth
Copy link
Contributor

rth commented Mar 14, 2020

AttributeError: 'list' object has no attribute 'unstack'

You need to install pandas, then the output will be a pandas.DataFrame instead of a list of dict.

Yes, O(m n²) for both time and memory complexity sounds plausible to me.
For memory, it should be better e.g. O(m n) for execute(..., backend='C') and providing n_closest_points, need to check.

Obviously O(n²) is not great, and we should try to reduce it if possible (also see #74)

@antunesleo
Copy link
Author

I did a couple of tests. For C backend, the memory usage decreases but time execution increases a lot. When using n_closest_points, the memory usage was reduced but with a similar execution time. Check the results.

NORMAL

          wall_time         peak_memory                                
method     __init__ execute    __init__  execute
n_samples                                       
2000           0.28    2.08        0.09   521.36
3000           0.53    5.54       68.65   870.05
4000           1.04   10.48      122.05  1294.18
5000           1.44   16.71      190.70  1693.82
6000           2.04   25.76      312.87  2133.00

BACKEND C

          wall_time         peak_memory                                
method     __init__ execute    __init__ execute
n_samples                                      
2000           0.24   18.58        0.09  158.43
3000           0.58   39.12       68.65  284.34
4000           1.03   64.21      122.05  440.90
5000           1.51   97.81      190.70  629.74
6000           2.17  130.98      311.39  847.28

BACKEND C with n_closest_points

                           wall_time         peak_memory                                                                                         
method                      __init__ execute    __init__ execute
n_samples n_closest_points                                      
2000      133                   0.23    1.41        0.09   61.07
3000      200                   0.52    5.18       34.32  137.27
4000      266                   0.97   10.59      122.05  271.36
5000      333                   1.53   16.28      190.70  438.62
6000      400                   2.05   25.59      308.16  735.38

If we didn't provide n_closest_points, does it consider all the samples? Note that my n_closest_points is 12 times smaller than my n_samples. I don't know how this can affect the scientific stuff of the process or the recommended n_closest_points for each scenario but backend='C' and n_closest_points can helps a lot to run in larger datasets.

I run the tests with an anaconda env because I was having some runtime errors with cython packages. I would like to run it in an regular virtualenv, do you now the packages needed to do it?

@MuellerSeb
Copy link
Member

Cython pre-build wheels are now provided for v1.5
We will refactor the whole process in v2. See: #136 #143

@GeoStat-Framework GeoStat-Framework locked and limited conversation to collaborators Jul 12, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
Projects
None yet
Development

No branches or pull requests

3 participants