Skip to content

Commit

Permalink
DataFrame conversion tutorial (#1240)
Browse files Browse the repository at this point in the history
* data frame conversion vignette

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* data frame conversion tutorial: executable code

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* PyCapsule discussion

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixup, minor edits

* fixup random versions ci job fail, add pymarginaleffects to readme

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Marco Gorelli <[email protected]>
  • Loading branch information
3 people authored Oct 27, 2024
1 parent 591992c commit e6abf27
Show file tree
Hide file tree
Showing 4 changed files with 81 additions and 1 deletion.
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,10 +43,11 @@ Join the party!

- [Altair](https://github.com/vega/altair/)
- [Hamilton](https://github.com/DAGWorks-Inc/hamilton/tree/main/examples/narwhals)
- [marimo](https://github.com/marimo-team/marimo)
- [pymarginaleffects](https://github.com/vincentarelbundock/pymarginaleffects)
- [scikit-lego](https://github.com/koaning/scikit-lego)
- [scikit-playtime](https://github.com/koaning/scikit-playtime)
- [timebasedcv](https://github.com/FBruzzesi/timebasedcv)
- [marimo](https://github.com/marimo-team/marimo)
- [tubular](https://github.com/lvgig/tubular)
- [wimsey](https://github.com/benrutter/wimsey)

Expand Down
76 changes: 76 additions & 0 deletions docs/basics/dataframe_conversion.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Conversion between libraries

Some library maintainers must apply complex dataframe operations, using methods and functions that may not (yet) be implemented in Narwhals. In such cases, Narwhals can still be highly beneficial, by allowing easy dataframe conversion.

## Dataframe X in, pandas out

Imagine that you maintain a library with a function that operates on pandas dataframes to produce automated reports. You want to allow users to supply a dataframe in any format to that function (pandas, Polars, DuckDB, cuDF, Modin, etc.) without adding all those dependencies to your own project and without special-casing each input library's variation of `to_pandas` / `toPandas` / `to_pandas_df` / `df` ...

One solution is to use Narwhals as a thin Dataframe ingestion layer, to convert user-supplied dataframe to the format that your library uses internally. Since Narwhals is zero-dependency, this is a much more lightweight solution than including all the dataframe libraries as dependencies,
and easier to write than special casing each input library's `to_pandas` method (if it even exists!).

To illustrate, we create dataframes in various formats:

```python exec="1" source="above" session="conversion"
import narwhals as nw
from narwhals.typing import IntoDataFrame

import duckdb
import polars as pl
import pandas as pd

df_polars = pl.DataFrame(
{
"A": [1, 2, 3, 4, 5],
"fruits": ["banana", "banana", "apple", "apple", "banana"],
"B": [5, 4, 3, 2, 1],
"cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
}
)
df_pandas = df_polars.to_pandas()
df_duckdb = duckdb.sql("SELECT * FROM df_polars")
```

Now, we define a function that can ingest any dataframe type supported by Narwhals, and convert it to a pandas DataFrame for internal use:

```python exec="1" source="above" session="conversion" result="python"
def df_to_pandas(df: IntoDataFrame) -> pd.DataFrame:
return nw.from_native(df).to_pandas()


print(df_to_pandas(df_polars))
```

## Dataframe X in, Polars out

### Via PyCapsule Interface

Similarly, if your library uses Polars internally, you can convert any user-supplied dataframe to Polars format using Narwhals.

```python exec="1" source="above" session="conversion" result="python"
def df_to_polars(df: IntoDataFrame) -> pl.DataFrame:
return nw.from_arrow(nw.from_native(df), native_namespace=pl).to_native()


print(df_to_polars(df_duckdb)) # You can only execute this line of code once.
```

It works to pass Polars to `native_namespace` here because Polars supports the [PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html) for import.

Note that the PyCapsule Interface makes no guarantee that you can call it repeatedly, so the approach above only works if you
only expect to perform the conversion a single time on each input object.

### Via PyArrow

If you need to ingest the same dataframe multiple times, then you may want to go via PyArrow instead.
This may be less efficient than the PyCapsule approach above (and always requires PyArrow!), but is more forgiving:

```python exec="1" source="above" session="conversion" result="python"
def df_to_polars(df: IntoDataFrame) -> pl.DataFrame:
return pl.DataFrame(nw.from_native(df).to_arrow())


df_duckdb = duckdb.sql("SELECT * FROM df_polars")
print(df_to_polars(df_duckdb)) # We can execute this...
print(df_to_polars(df_duckdb)) # ...as many times as we like!
```
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ nav:
- basics/dataframe.md
- basics/series.md
- basics/complete_example.md
- basics/dataframe_conversion.md
- Pandas-like concepts:
- other/pandas_index.md
- other/user_warning.md
Expand Down
2 changes: 2 additions & 0 deletions tests/expr_and_series/dt/timestamp_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@

import narwhals.stable.v1 as nw
from tests.utils import PANDAS_VERSION
from tests.utils import POLARS_VERSION
from tests.utils import PYARROW_VERSION
from tests.utils import Constructor
from tests.utils import ConstructorEager
Expand Down Expand Up @@ -197,6 +198,7 @@ def test_timestamp_invalid_unit_series(constructor_eager: ConstructorEager) -> N
starting_time_unit=st.sampled_from(["us", "ns"]),
)
@pytest.mark.skipif(PANDAS_VERSION < (2, 2), reason="bug in old pandas")
@pytest.mark.skipif(POLARS_VERSION < (0, 20, 7), reason="bug in old Polars")
def test_timestamp_hypothesis(
inputs: datetime,
time_unit: Literal["ms", "us", "ns"],
Expand Down

0 comments on commit e6abf27

Please sign in to comment.