DataFrame conversion tutorial (#1240)

* data frame conversion vignette * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * data frame conversion tutorial: executable code * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * PyCapsule discussion * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixup, minor edits * fixup random versions ci job fail, add pymarginaleffects to readme --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Marco Gorelli <[email protected]>
narwhals-dev · Oct 27, 2024 · e6abf27 · e6abf27
1 parent 591992c
commit e6abf27
Show file tree

Hide file tree

Showing 4 changed files with 81 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -43,10 +43,11 @@ Join the party!
 
 - [Altair](https://github.com/vega/altair/)
 - [Hamilton](https://github.com/DAGWorks-Inc/hamilton/tree/main/examples/narwhals)
+- [marimo](https://github.com/marimo-team/marimo)
+- [pymarginaleffects](https://github.com/vincentarelbundock/pymarginaleffects)
 - [scikit-lego](https://github.com/koaning/scikit-lego)
 - [scikit-playtime](https://github.com/koaning/scikit-playtime)
 - [timebasedcv](https://github.com/FBruzzesi/timebasedcv)
-- [marimo](https://github.com/marimo-team/marimo)
 - [tubular](https://github.com/lvgig/tubular)
 - [wimsey](https://github.com/benrutter/wimsey)
 

diff --git a/docs/basics/dataframe_conversion.md b/docs/basics/dataframe_conversion.md
@@ -0,0 +1,76 @@
+# Conversion between libraries
+
+Some library maintainers must apply complex dataframe operations, using methods and functions that may not (yet) be implemented in Narwhals. In such cases, Narwhals can still be highly beneficial, by allowing easy dataframe conversion.
+
+## Dataframe X in, pandas out
+
+Imagine that you maintain a library with a function that operates on pandas dataframes to produce automated reports. You want to allow users to supply a dataframe in any format to that function (pandas, Polars, DuckDB, cuDF, Modin, etc.) without adding all those dependencies to your own project and without special-casing each input library's variation of `to_pandas` / `toPandas` / `to_pandas_df` / `df` ...
+
+One solution is to use Narwhals as a thin Dataframe ingestion layer, to convert user-supplied dataframe to the format that your library uses internally. Since Narwhals is zero-dependency, this is a much more lightweight solution than including all the dataframe libraries as dependencies,
+and easier to write than special casing each input library's `to_pandas` method (if it even exists!).
+
+To illustrate, we create dataframes in various formats:
+
+```python exec="1" source="above" session="conversion"
+import narwhals as nw
+from narwhals.typing import IntoDataFrame
+
+import duckdb
+import polars as pl
+import pandas as pd
+
+df_polars = pl.DataFrame(
+    {
+        "A": [1, 2, 3, 4, 5],
+        "fruits": ["banana", "banana", "apple", "apple", "banana"],
+        "B": [5, 4, 3, 2, 1],
+        "cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
+    }
+)
+df_pandas = df_polars.to_pandas()
+df_duckdb = duckdb.sql("SELECT * FROM df_polars")
+```
+
+Now, we define a function that can ingest any dataframe type supported by Narwhals, and convert it to a pandas DataFrame for internal use:
+
+```python exec="1" source="above" session="conversion" result="python"
+def df_to_pandas(df: IntoDataFrame) -> pd.DataFrame:
+    return nw.from_native(df).to_pandas()
+
+
+print(df_to_pandas(df_polars))
+```
+
+## Dataframe X in, Polars out
+
+### Via PyCapsule Interface
+
+Similarly, if your library uses Polars internally, you can convert any user-supplied dataframe to Polars format using Narwhals.
+
+```python exec="1" source="above" session="conversion" result="python"
+def df_to_polars(df: IntoDataFrame) -> pl.DataFrame:
+    return nw.from_arrow(nw.from_native(df), native_namespace=pl).to_native()
+
+
+print(df_to_polars(df_duckdb))  # You can only execute this line of code once.
+```
+
+It works to pass Polars to `native_namespace` here because Polars supports the [PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html) for import.
+
+Note that the PyCapsule Interface makes no guarantee that you can call it repeatedly, so the approach above only works if you
+only expect to perform the conversion a single time on each input object.
+
+### Via PyArrow
+
+If you need to ingest the same dataframe multiple times, then you may want to go via PyArrow instead.
+This may be less efficient than the PyCapsule approach above (and always requires PyArrow!), but is more forgiving:
+
+```python exec="1" source="above" session="conversion" result="python"
+def df_to_polars(df: IntoDataFrame) -> pl.DataFrame:
+    return pl.DataFrame(nw.from_native(df).to_arrow())
+
+
+df_duckdb = duckdb.sql("SELECT * FROM df_polars")
+print(df_to_polars(df_duckdb))  # We can execute this...
+print(df_to_polars(df_duckdb))  # ...as many times as we like!
+```
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -10,6 +10,7 @@ nav:
     - basics/dataframe.md
     - basics/series.md
     - basics/complete_example.md
+    - basics/dataframe_conversion.md
   - Pandas-like concepts:
     - other/pandas_index.md
     - other/user_warning.md

diff --git a/tests/expr_and_series/dt/timestamp_test.py b/tests/expr_and_series/dt/timestamp_test.py
@@ -11,6 +11,7 @@
 
 import narwhals.stable.v1 as nw
 from tests.utils import PANDAS_VERSION
+from tests.utils import POLARS_VERSION
 from tests.utils import PYARROW_VERSION
 from tests.utils import Constructor
 from tests.utils import ConstructorEager
@@ -197,6 +198,7 @@ def test_timestamp_invalid_unit_series(constructor_eager: ConstructorEager) -> N
     starting_time_unit=st.sampled_from(["us", "ns"]),
 )
 @pytest.mark.skipif(PANDAS_VERSION < (2, 2), reason="bug in old pandas")
+@pytest.mark.skipif(POLARS_VERSION < (0, 20, 7), reason="bug in old Polars")
 def test_timestamp_hypothesis(
     inputs: datetime,
     time_unit: Literal["ms", "us", "ns"],