Ordinary Kriging Complexity #140

antunesleo · 2020-03-14T00:55:45Z

What's the ordinary kriging complexity?

I've been searching for the ordinary kriging algorithm complexity but couldn't find it. I found a lot of papers and academic stuff, but I don't have the background to understand it. Big O notations it's something that I can actually understand before studying the kriging process deeply!

The complexities of the other kriging types is usefull as well.

I appreciate if anyone can help me

rth · 2020-03-14T12:12:50Z

Independently of the theoretical complexity the most reliable apporach I think is to measure it. For instance on randomly distributed data, changing the number of samples and running krigging on a 100x100 grid the script below produces,

          wall_time (s)       peak_memory (MB)                                                                                                
method     __init__ execute    __init__  execute
n_samples                                       
2000           0.11    1.46        0.09   817.01
4000           0.45    5.66      122.03  1891.70
8000           1.97   26.11      619.51  4228.58

so the time complexity is roughly O(n_samples²) both for the __init__ and the execute method. The memory usage complexity also seem to be O(n_samples²), however it can be improved by using a different backend than vectorized in exectute method.

Overall there should be room for improvement I think.

benchmark.py (run with OMP_NUM_THREADS=2 python benchmark.py)

import numpy as np
import pykrige.kriging_tools as kt
from pykrige.ok import OrdinaryKriging
from neurtu import Benchmark, delayed

rng = np.random.RandomState(0)

max_samples = 100000

data = 10*rng.rand(max_samples, 2)
z = rng.rand(max_samples)


gridx = np.linspace(0, 10, 100)
gridy = np.linspace(0, 10, 100)

kwargs = dict(variogram_model='linear',         
              verbose=False, enable_plotting=False)

def bench_cases():
    for n_samples in [2000, 4000, 8000]:
       tags =  {'n_samples': n_samples, 'method': '__init__'}

       ok = delayed(OrdinaryKriging, tags=tags)(
               data[:n_samples, 0], data[:n_samples, 1], z[:n_samples])
       yield ok
       tags =  {'n_samples': n_samples, 'method': 'execute'}
       ok = OrdinaryKriging(data[:n_samples, 0], data[:n_samples, 1], z[:n_samples])
       yield delayed(ok, tags=tags).execute('grid', gridx, gridy)
   
res = Benchmark(wall_time=True, peak_memory=True)(bench_cases())
print(res.unstack(-1).round(2))

you can adapt this script to evaluate the complexity with respect to other parameter (e.g. grid size etc).

antunesleo · 2020-03-14T16:59:42Z

Thank you @rth, this script helped a lot! I had an issue printing the output but I did a workaround, so my outputs are not cool like yours. Follow the error.

print(res.unstack(-1).round(2))

Traceback (most recent call last):                                              
  File "benchmark.py", line 32, in <module>
    print(res.unstack(-1).round(2))
AttributeError: 'list' object has no attribute 'unstack'

I did some simple experiments that I would like to share.

TIME COMPLEXITY

Samples time complexity

Checking time variation between 50 to 6400 samples

{'n_samples': 50, 'method': 'execute', 'wall_time': 0.007776964411759228, 'peak_memory': 0.0}
{'n_samples': 100, 'method': 'execute', 'wall_time': 0.016293986124992443, 'peak_memory': 0.0}
{'n_samples': 200, 'method': 'execute', 'wall_time': 0.03170493699999497, 'peak_memory': 0.0}
{'n_samples': 400, 'method': 'execute', 'wall_time': 0.06674067249991822, 'peak_memory': 0.0}
{'n_samples': 800, 'method': 'execute', 'wall_time': 0.2111246059998848, 'peak_memory': 64.69921875}
{'n_samples': 1600, 'method': 'execute', 'wall_time': 0.5659971460004272, 'peak_memory': 152.55078125}
{'n_samples': 3200, 'method': 'execute', 'wall_time': 2.3771720049999203, 'peak_memory': 461.609375}
{'n_samples': 6400, 'method': 'execute', 'wall_time': 13.476873717000217, 'peak_memory': 1122.82421875}

If we plot this in a chart and compare to a chart for O(n2) we can see that it's a little bit complexy than O(n2), but almost the same thing. (This isn't a controlled experiment)

So, I think we can say samples time complexity is O(n2).

Grid time complexity

Checking the time variation between 625 (25 x 25) to 160000 (400 x 400) grid cells

25 x 25 grid (625 cells), 50 to 1600 samples

{'n_samples': 25, 'method': 'execute', 'wall_time': 0.0023299097547179663, 'peak_memory': 0.0}                                                  
{'n_samples': 50, 'method': 'execute', 'wall_time': 0.003100287200004459, 'peak_memory': 0.0}
{'n_samples': 100, 'method': 'execute', 'wall_time': 0.005714163655158763, 'peak_memory': 0.0}
{'n_samples': 200, 'method': 'execute', 'wall_time': 0.010084943200005607, 'peak_memory': 0.0}
{'n_samples': 400, 'method': 'execute', 'wall_time': 0.02489943083325367, 'peak_memory': 0.0}
{'n_samples': 800, 'method': 'execute', 'wall_time': 0.0893493009998565, 'peak_memory': 0.0}
{'n_samples': 1600, 'method': 'execute', 'wall_time': 0.3013877789999242, 'peak_memory': 58.51953125}

50 x 50 grid (2500 cells), 50 to 1600 samples

{'n_samples': 25, 'method': 'execute', 'wall_time': 0.004198298941178911, 'peak_memory': 0.0}                                                    
{'n_samples': 50, 'method': 'execute', 'wall_time': 0.008267518111122425, 'peak_memory': 0.0}
{'n_samples': 100, 'method': 'execute', 'wall_time': 0.015475255599994853, 'peak_memory': 0.0}
{'n_samples': 200, 'method': 'execute', 'wall_time': 0.03131875079998281, 'peak_memory': 0.0}
{'n_samples': 400, 'method': 'execute', 'wall_time': 0.07004850750013247, 'peak_memory': 0.0}
{'n_samples': 800, 'method': 'execute', 'wall_time': 0.16749932599987005, 'peak_memory': 0.0}
{'n_samples': 1600, 'method': 'execute', 'wall_time': 0.5587183059997187, 'peak_memory': 115.80859375}

100 x 100 grid (10000 cells, 4x bigger than the previous one), 50 to 1600 samples

{'n_samples': 25, 'method': 'execute', 'wall_time': 0.01494611433334588, 'peak_memory': 0.0}                                                     
{'n_samples': 50, 'method': 'execute', 'wall_time': 0.02806569128571417, 'peak_memory': 0.0}
{'n_samples': 100, 'method': 'execute', 'wall_time': 0.05157456866678937, 'peak_memory': 0.0}
{'n_samples': 200, 'method': 'execute', 'wall_time': 0.10322511200001827, 'peak_memory': 0.0}
{'n_samples': 400, 'method': 'execute', 'wall_time': 0.25305722599978253, 'peak_memory': 91.58203125}
{'n_samples': 800, 'method': 'execute', 'wall_time': 0.5502379879999353, 'peak_memory': 244.7421875}
{'n_samples': 1600, 'method': 'execute', 'wall_time': 1.4694017530000565, 'peak_memory': 648.85546875}

200 x 200 grid (40000 cells, 4x bigger than the previous one), 50 to 1600 samples

{'n_samples': 25, 'method': 'execute', 'wall_time': 0.0507471066666767, 'peak_memory': 0.0}                                                      
{'n_samples': 50, 'method': 'execute', 'wall_time': 0.09184435899987875, 'peak_memory': 0.0}
{'n_samples': 100, 'method': 'execute', 'wall_time': 0.18824201999996149, 'peak_memory': 100.34375}
{'n_samples': 200, 'method': 'execute', 'wall_time': 0.37816304099987974, 'peak_memory': 244.765625}
{'n_samples': 400, 'method': 'execute', 'wall_time': 0.8037154940002438, 'peak_memory': 489.2109375}
{'n_samples': 800, 'method': 'execute', 'wall_time': 1.7659289359999093, 'peak_memory': 1255.6328125}
{'n_samples': 1600, 'method': 'execute', 'wall_time': 4.836020948000169, 'peak_memory': 2663.15234375}

400 x 400 grid (160000 cells, 4x bigger than the previous one), 50 to 1600 samples

{'n_samples': 25, 'method': 'execute', 'wall_time': 0.28511154499983604, 'peak_memory':95.1484375}                                                   
{'n_samples': 50, 'method': 'execute', 'wall_time': 0.4336792220001371, 'peak_memory': 267.8125}
{'n_samples': 100, 'method': 'execute', 'wall_time': 0.7693347370000083, 'peak_memory': 491.953125}
{'n_samples': 200, 'method': 'execute', 'wall_time': 1.692223333000129, 'peak_memory': 1251.76953125}
{'n_samples': 400, 'method': 'execute', 'wall_time': 3.5027248790001977, 'peak_memory': 2652.76953125}
{'n_samples': 800, 'method': 'execute', 'wall_time': 9.453671161999864, 'peak_memory': 5366.30078125}
{'n_samples': 1600, 'method': 'execute', 'wall_time': 23.468780739000067, 'peak_memory': 10734.34765625}

If get all results for 100 samples for example and plot in a chart, we can see a linear grow.

Conclusion

If we say samples is n and number of grid cells is m, can we say that the complexity time for grid and samples is O(mn2)?

### Memory Usage Complexity
Memory usage seems to be O(n2) for samples and O(m) for number of grid cells too, so O(mn2).

Is this correct?

Thanks for the help!

rth · 2020-03-14T17:41:35Z

AttributeError: 'list' object has no attribute 'unstack'

You need to install pandas, then the output will be a pandas.DataFrame instead of a list of dict.

Yes, O(m n²) for both time and memory complexity sounds plausible to me.
For memory, it should be better e.g. O(m n) for execute(..., backend='C') and providing n_closest_points, need to check.

Obviously O(n²) is not great, and we should try to reduce it if possible (also see #74)

antunesleo · 2020-03-15T14:32:27Z

I did a couple of tests. For C backend, the memory usage decreases but time execution increases a lot. When using n_closest_points, the memory usage was reduced but with a similar execution time. Check the results.

NORMAL

          wall_time         peak_memory                                
method     __init__ execute    __init__  execute
n_samples                                       
2000           0.28    2.08        0.09   521.36
3000           0.53    5.54       68.65   870.05
4000           1.04   10.48      122.05  1294.18
5000           1.44   16.71      190.70  1693.82
6000           2.04   25.76      312.87  2133.00

BACKEND C

          wall_time         peak_memory                                
method     __init__ execute    __init__ execute
n_samples                                      
2000           0.24   18.58        0.09  158.43
3000           0.58   39.12       68.65  284.34
4000           1.03   64.21      122.05  440.90
5000           1.51   97.81      190.70  629.74
6000           2.17  130.98      311.39  847.28

BACKEND C with `n_closest_points`

                           wall_time         peak_memory                                                                                         
method                      __init__ execute    __init__ execute
n_samples n_closest_points                                      
2000      133                   0.23    1.41        0.09   61.07
3000      200                   0.52    5.18       34.32  137.27
4000      266                   0.97   10.59      122.05  271.36
5000      333                   1.53   16.28      190.70  438.62
6000      400                   2.05   25.59      308.16  735.38

If we didn't provide n_closest_points, does it consider all the samples? Note that my n_closest_points is 12 times smaller than my n_samples. I don't know how this can affect the scientific stuff of the process or the recommended n_closest_points for each scenario but backend='C' and n_closest_points can helps a lot to run in larger datasets.

I run the tests with an anaconda env because I was having some runtime errors with cython packages. I would like to run it in an regular virtualenv, do you now the packages needed to do it?

MuellerSeb · 2020-04-04T17:20:34Z

Cython pre-build wheels are now provided for v1.5
We will refactor the whole process in v2. See: #136 #143

MuellerSeb added the question label Mar 26, 2020

MuellerSeb closed this as completed Jul 12, 2021

GeoStat-Framework locked and limited conversation to collaborators Jul 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Ordinary Kriging Complexity #140

Ordinary Kriging Complexity #140

antunesleo commented Mar 14, 2020

rth commented Mar 14, 2020

antunesleo commented Mar 14, 2020

rth commented Mar 14, 2020

antunesleo commented Mar 15, 2020

MuellerSeb commented Apr 4, 2020

This issue was moved to a discussion.

This issue was moved to a discussion.

Ordinary Kriging Complexity #140

Ordinary Kriging Complexity #140

Comments

antunesleo commented Mar 14, 2020

rth commented Mar 14, 2020

antunesleo commented Mar 14, 2020

TIME COMPLEXITY

Samples time complexity

Grid time complexity

Conclusion

rth commented Mar 14, 2020

antunesleo commented Mar 15, 2020

NORMAL

BACKEND C

BACKEND C with n_closest_points

MuellerSeb commented Apr 4, 2020

This issue was moved to a discussion.

BACKEND C with `n_closest_points`