rle-array

Extension Array for Pandas that implements Run-length Encoding.

Table of Contents

Quick Start
Development Plan
Implementation
License

Quick Start

Some basic setup first:

>>> import pandas as pd
>>> pd.set_option("display.max_rows", 40)
>>> pd.set_option("display.width", None)

We need some example data, so let's create some pseudo-weather data:

>>> from rle_array.testing import generate_example
>>> df = generate_example()
>>> df.head(10)
        date  month  year    city    country   avg_temp   rain   mood
0 2000-01-01      1  2000  city_0  country_0  12.400000  False     ok
1 2000-01-02      1  2000  city_0  country_0   4.000000  False     ok
2 2000-01-03      1  2000  city_0  country_0  17.200001  False  great
3 2000-01-04      1  2000  city_0  country_0   8.400000  False     ok
4 2000-01-05      1  2000  city_0  country_0   6.400000  False     ok
5 2000-01-06      1  2000  city_0  country_0  14.400000  False     ok
6 2000-01-07      1  2000  city_0  country_0  14.300000   True     ok
7 2000-01-08      1  2000  city_0  country_0   6.800000  False     ok
8 2000-01-09      1  2000  city_0  country_0  10.100000  False     ok
9 2000-01-10      1  2000  city_0  country_0  -1.200000  False     ok

Due to the large number of attributes for locations and the date, the data size is quite large:

>>> df.memory_usage()
Index            128
date        32000000
month        4000000
year         8000000
city        32000000
country     32000000
avg_temp    16000000
rain         4000000
mood        32000000
dtype: int64
>>> df.memory_usage().sum()
160000128

To compress the data, we can use rle-array:

>>> import rle_array
>>> df_rle = df.astype({
...     "city": "RLEDtype[object]",
...     "country": "RLEDtype[object]",
...     "month": "RLEDtype[int8]",
...     "mood": "RLEDtype[object]",
...     "rain": "RLEDtype[bool]",
...     "year": "RLEDtype[int16]",
... })
>>> df_rle.memory_usage()
Index            128
date        32000000
month        1188000
year          120000
city           32000
country           64
avg_temp    16000000
rain         6489477
mood        17153296
dtype: int64
>>> df_rle.memory_usage().sum()
72982965

This works better the longer the runs are. In the above example, it does not work too well for "rain".

Development Plan

The development of rle-array has the following priorities (in decreasing order):

Correctness: All results must be correct. The Pandas-provided test suite must pass. Approximation are not allowed.
Transparency: The user can use :class:`~rle_array.RLEDtype` and :class:`~rle_array.RLEArray` like other Pandas types. No special parameters or extra functions are required.
Features: Support all features that Pandas offers, even if it is slow (but inform the user using a :class:`pandas.errors.PerformanceWarning`).
Simplicity: Do not use Python C Extensions or Cython (NumPy and Numba are allowed).
Memory Reduction: Do not decompress the encoded data when not required, try to do as many calculations directly on the compressed representation.
Performance: It should be quick, for large data ideally faster than working on the uncompressed data. Use Numba to speed up code.

Implementation

Imagine the following data array:

Index	Data
1	"a"
2	"a"
3	"a"
4	"x"
5	"c"
6	"c"
7	"a"
8	"a"

There some data points valid for multiple entries in a row:

Index	Data
1	"a"
2
3
4	"x"
5	"c"
6	"c"
7	"a"
8	"a"

These sections are also called runs and can be encoded by their value and their length:

Length	Value
3	"a"
1	"x"
2	"c"
2	"a"

This representation is called Run-length Encoding. To integrate this encoding better with Pandas and NumPy and to support operations like slicing and random access (e.g. via :func:`pandas.api.extensions.ExtensionArray.take`), we store the end position (the cum-sum of the length column) instead of the length:

End-position	Value
3	"a"
4	"x"
6	"c"
8	"a"

The value array is an :class:`numpy.ndarray` with the same dtype as the original data and the end-positions are an :class:`numpy.ndarray` with the dtype int64.

License

Licensed under:

MIT License (LICENSE.txt or https://opensource.org/licenses/MIT)

Name		Name	Last commit message	Last commit date
Latest commit History 148 Commits
.github		.github
benchmarks		benchmarks
docs		docs
rle_array		rle_array
scripts		scripts
tests		tests
.editorconfig		.editorconfig
.flake8		.flake8
.gitignore		.gitignore
CHANGES.rst		CHANGES.rst
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
README.rst		README.rst
asv.conf.json		asv.conf.json
codecov.yml		codecov.yml
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rle-array

Quick Start

Development Plan

Implementation

License

About

Releases

Packages

Contributors 4

Languages

License

JDASoftwareGroup/rle-array

Folders and files

Latest commit

History

Repository files navigation

rle-array

Quick Start

Development Plan

Implementation

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages