Skip to content

JDASoftwareGroup/rle-array

Repository files navigation

rle-array

Build Status Coverage Status

Extension Array for Pandas that implements Run-length Encoding.

Some basic setup first:

>>> import pandas as pd
>>> pd.set_option("display.max_rows", 40)
>>> pd.set_option("display.width", None)

We need some example data, so let's create some pseudo-weather data:

>>> from rle_array.testing import generate_example
>>> df = generate_example()
>>> df.head(10)
        date  month  year    city    country   avg_temp   rain   mood
0 2000-01-01      1  2000  city_0  country_0  12.400000  False     ok
1 2000-01-02      1  2000  city_0  country_0   4.000000  False     ok
2 2000-01-03      1  2000  city_0  country_0  17.200001  False  great
3 2000-01-04      1  2000  city_0  country_0   8.400000  False     ok
4 2000-01-05      1  2000  city_0  country_0   6.400000  False     ok
5 2000-01-06      1  2000  city_0  country_0  14.400000  False     ok
6 2000-01-07      1  2000  city_0  country_0  14.300000   True     ok
7 2000-01-08      1  2000  city_0  country_0   6.800000  False     ok
8 2000-01-09      1  2000  city_0  country_0  10.100000  False     ok
9 2000-01-10      1  2000  city_0  country_0  -1.200000  False     ok

Due to the large number of attributes for locations and the date, the data size is quite large:

>>> df.memory_usage()
Index            128
date        32000000
month        4000000
year         8000000
city        32000000
country     32000000
avg_temp    16000000
rain         4000000
mood        32000000
dtype: int64
>>> df.memory_usage().sum()
160000128

To compress the data, we can use rle-array:

>>> import rle_array
>>> df_rle = df.astype({
...     "city": "RLEDtype[object]",
...     "country": "RLEDtype[object]",
...     "month": "RLEDtype[int8]",
...     "mood": "RLEDtype[object]",
...     "rain": "RLEDtype[bool]",
...     "year": "RLEDtype[int16]",
... })
>>> df_rle.memory_usage()
Index            128
date        32000000
month        1188000
year          120000
city           32000
country           64
avg_temp    16000000
rain         6489477
mood        17153296
dtype: int64
>>> df_rle.memory_usage().sum()
72982965

This works better the longer the runs are. In the above example, it does not work too well for "rain".

The development of rle-array has the following priorities (in decreasing order):

  1. Correctness: All results must be correct. The Pandas-provided test suite must pass. Approximation are not allowed.
  2. Transparency: The user can use :class:`~rle_array.RLEDtype` and :class:`~rle_array.RLEArray` like other Pandas types. No special parameters or extra functions are required.
  3. Features: Support all features that Pandas offers, even if it is slow (but inform the user using a :class:`pandas.errors.PerformanceWarning`).
  4. Simplicity: Do not use Python C Extensions or Cython (NumPy and Numba are allowed).
  5. Memory Reduction: Do not decompress the encoded data when not required, try to do as many calculations directly on the compressed representation.
  6. Performance: It should be quick, for large data ideally faster than working on the uncompressed data. Use Numba to speed up code.

Imagine the following data array:

Index Data
1 "a"
2 "a"
3 "a"
4 "x"
5 "c"
6 "c"
7 "a"
8 "a"

There some data points valid for multiple entries in a row:

Index Data
1 "a"
2
3
4 "x"
5 "c"
6
7 "a"
8

These sections are also called runs and can be encoded by their value and their length:

Length Value
3 "a"
1 "x"
2 "c"
2 "a"

This representation is called Run-length Encoding. To integrate this encoding better with Pandas and NumPy and to support operations like slicing and random access (e.g. via :func:`pandas.api.extensions.ExtensionArray.take`), we store the end position (the cum-sum of the length column) instead of the length:

End-position Value
3 "a"
4 "x"
6 "c"
8 "a"

The value array is an :class:`numpy.ndarray` with the same dtype as the original data and the end-positions are an :class:`numpy.ndarray` with the dtype int64.

Licensed under: