Skip to content

Commit

Permalink
Merge pull request #41 from abstractqqq/various_adhoc
Browse files Browse the repository at this point in the history
Various adhoc
  • Loading branch information
abstractqqq authored Dec 27, 2023
2 parents f8c466b + 47af227 commit b7961b7
Show file tree
Hide file tree
Showing 18 changed files with 985 additions and 901 deletions.
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,9 +1,14 @@
.ipynb_checkpoints

# Local, quick adhoc test only purpose
tests/*.ipynb
tests/test.ipynb

/target

# Mkdocs
site/

# Ruff
.ruff_cache/

Expand Down
75 changes: 59 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,33 +18,76 @@ pip install polars_ds
and

```python
import polars_ds
import polars_ds as pld
```
when you want to use the namespaces provided by the package.

## Examples

Generating random numbers, and running t-test, normality test inside a dataframe
In-dataframe statistical testing
```python
df.with_columns(
pl.col("a").stats.sample_normal(mean = 0.5, std = 1.).alias("test1")
, pl.col("a").stats.sample_normal(mean = 0.5, std = 2.).alias("test2")
).select(
pl.col("test1").stats.ttest_ind(pl.col("test2"), equal_var = False).alias("t-test")
, pl.col("test1").stats.normal_test().alias("normality_test")
).select(
pl.col("t-test").struct.field("statistic").alias("t-tests: statistics")
, pl.col("t-test").struct.field("pvalue").alias("t-tests: pvalue")
, pl.col("normality_test").struct.field("statistic").alias("normality_test: statistics")
, pl.col("normality_test").struct.field("pvalue").alias("normality_test: pvalue")
df.select(
pl.col("group1").stats.ttest_ind(pl.col("group2"), equal_var = True).alias("t-test"),
pl.col("category_1").stats.chi2(pl.col("category_2")).alias("chi2-test"),
pl.col("category_1").stats.f_test(pl.col("group1")).alias("f-test")
)

shape: (1, 3)
┌───────────────────┬──────────────────────┬────────────────────┐
│ t-test ┆ chi2-test ┆ f-test │
---------
│ struct[2] ┆ struct[2] ┆ struct[2] │
╞═══════════════════╪══════════════════════╪════════════════════╡
│ {-0.004,0.996809} ┆ {37.823816,0.386001} ┆ {1.354524,0.24719} │
└───────────────────┴──────────────────────┴────────────────────┘
```

Blazingly fast string similarity comparisons. (Thanks to [RapidFuzz](https://docs.rs/rapidfuzz/latest/rapidfuzz/))
Generating random numbers according to reference column
```python
df2.select(
pl.col("word").str2.levenshtein("world", return_sim = True)
df.with_columns(
# Sample from normal distribution, using reference column "a" 's mean and std
pl.col("a").stats.sample_normal().alias("test1")
# Sample from uniform distribution, with low = 0 and high = "a"'s max, and respect the nulls in "a"
, pl.col("a").stats.sample_uniform(low = 0., high = None, respect_null=True).alias("test2")
).head()

shape: (5, 3)
┌───────────┬───────────┬──────────┐
│ a ┆ test1 ┆ test2 │
---------
│ f64 ┆ f64 ┆ f64 │
╞═══════════╪═══════════╪══════════╡
│ null ┆ 0.459357 ┆ null │
│ null ┆ 0.038007 ┆ null │
-0.8265180.2419630.968385
0.737955-0.8194752.429615
1.10397-0.6842892.483368
└───────────┴───────────┴──────────┘
```

Blazingly fast string similarity comparisons. (Thanks to [RapidFuzz](https://docs.rs/rapidfuzz/latest/rapidfuzz/))
```python
df.select(
pl.col("word").str2.levenshtein("asasasa", return_sim=True).alias("asasasa"),
pl.col("word").str2.levenshtein("sasaaasss", return_sim=True).alias("sasaaasss"),
pl.col("word").str2.levenshtein("asdasadadfa", return_sim=True).alias("asdasadadfa"),
pl.col("word").str2.fuzz("apples").alias("LCS based Fuzz match - apples"),
pl.col("word").str2.osa("apples", return_sim = True).alias("Optimal String Alignment - apples"),
pl.col("word").str2.jw("apples").alias("Jaro-Winkler - apples"),
)
shape: (5, 6)
┌──────────┬───────────┬─────────────┬────────────────┬───────────────────────────┬────────────────┐
│ asasasa ┆ sasaaasss ┆ asdasadadfa ┆ LCS based Fuzz ┆ Optimal String Alignment ┆ Jaro-Winkler -
--------- ┆ match - apples ┆ - apple… ┆ apples │
│ f64 ┆ f64 ┆ f64 ┆ ---------
│ ┆ ┆ ┆ f64 ┆ f64 ┆ f64 │
╞══════════╪═══════════╪═════════════╪════════════════╪═══════════════════════════╪════════════════╡
0.1428570.1111110.0909090.8333330.8333330.966667
0.4285710.3333330.2727270.1666670.00.444444
0.1111110.1111110.0909090.5555560.4444440.5
0.8750.6666670.5454550.250.250.527778
0.750.7777780.4545450.250.250.527778
└──────────┴───────────┴─────────────┴────────────────┴───────────────────────────┴────────────────┘
```

Even in-dataframe nearest neighbors queries! 😲
Expand Down
2 changes: 1 addition & 1 deletion docs/complex.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
## Extension for Complex Numbers

::: polars_ds.complex
::: polars_ds.complex.ComplexExt
121 changes: 101 additions & 20 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,12 @@
# Polars-ds
# Polars Extension for General Data Science Use

A Polars Plugin aiming to simplify common numerical/string data analysis procedures. This means that the most basic data science, stats, NLP related tasks can be done natively inside a dataframe, without leaving dataframe world. This also means that for simple data pipelines, you do not need to install NumPy/Scipy/Scikit-learn, which saves a lot of space, which is great under constrained resources.
A Polars Plugin aiming to simplify common numerical/string data analysis procedures. This means that the most basic data science, stats, NLP related tasks can be done natively inside a dataframe, thus minimizing the number of dependencies.

Its goal is NOT to replace SciPy, or NumPy, but rather it tries to reduce dependency for simple analysis, and tries to reduce Python side code and UDFs, which are often performance bottlenecks.
Its goal is not to replace SciPy, or NumPy, but rather it tries to improve runtime for common tasks, reduce Python code and UDFs.

See examples [here](https://github.com/abstractqqq/polars_ds_extension/blob/main/examples/basics.ipynb).

**Currently in Beta. Feel free to submit feature requests in the issues section of the repo.**

## Getting Started
```bash
Expand All @@ -12,33 +16,110 @@ pip install polars_ds
and

```python
import polars_ds
import polars_ds as pld
```
when you want to use the namespaces provided by the package.

## Examples

Generating random numbers, and running t-test, normality test inside a dataframe
In-dataframe statistical testing
```python
df.with_columns(
pl.col("a").stats_ext.sample_normal(mean = 0.5, std = 1.).alias("test1")
, pl.col("a").stats_ext.sample_normal(mean = 0.5, std = 2.).alias("test2")
).select(
pl.col("test1").stats_ext.ttest_ind(pl.col("test2"), equal_var = False).alias("t-test")
, pl.col("test1").stats_ext.normal_test().alias("normality_test")
).select(
pl.col("t-test").struct.field("statistic").alias("t-tests: statistics")
, pl.col("t-test").struct.field("pvalue").alias("t-tests: pvalue")
, pl.col("normality_test").struct.field("statistic").alias("normality_test: statistics")
, pl.col("normality_test").struct.field("pvalue").alias("normality_test: pvalue")
df.select(
pl.col("group1").stats.ttest_ind(pl.col("group2"), equal_var = True).alias("t-test"),
pl.col("category_1").stats.chi2(pl.col("category_2")).alias("chi2-test"),
pl.col("category_1").stats.f_test(pl.col("group1")).alias("f-test")
)

shape: (1, 3)
┌───────────────────┬──────────────────────┬────────────────────┐
│ t-test ┆ chi2-test ┆ f-test │
---------
│ struct[2] ┆ struct[2] ┆ struct[2] │
╞═══════════════════╪══════════════════════╪════════════════════╡
│ {-0.004,0.996809} ┆ {37.823816,0.386001} ┆ {1.354524,0.24719} │
└───────────────────┴──────────────────────┴────────────────────┘
```

Blazingly fast string similarity comparisons. (Thanks to [RapidFuzz](https://docs.rs/rapidfuzz/latest/rapidfuzz/))
Generating random numbers according to reference column
```python
df2.select(
pl.col("word").str_ext.levenshtein("world", return_sim = True)
df.with_columns(
# Sample from normal distribution, using reference column "a" 's mean and std
pl.col("a").stats.sample_normal().alias("test1")
# Sample from uniform distribution, with low = 0 and high = "a"'s max, and respect the nulls in "a"
, pl.col("a").stats.sample_uniform(low = 0., high = None, respect_null=True).alias("test2")
).head()

shape: (5, 3)
┌───────────┬───────────┬──────────┐
│ a ┆ test1 ┆ test2 │
---------
│ f64 ┆ f64 ┆ f64 │
╞═══════════╪═══════════╪══════════╡
│ null ┆ 0.459357 ┆ null │
│ null ┆ 0.038007 ┆ null │
-0.8265180.2419630.968385
0.737955-0.8194752.429615
1.10397-0.6842892.483368
└───────────┴───────────┴──────────┘
```

Blazingly fast string similarity comparisons. (Thanks to [RapidFuzz](https://docs.rs/rapidfuzz/latest/rapidfuzz/))
```python
df.select(
pl.col("word").str2.levenshtein("asasasa", return_sim=True).alias("asasasa"),
pl.col("word").str2.levenshtein("sasaaasss", return_sim=True).alias("sasaaasss"),
pl.col("word").str2.levenshtein("asdasadadfa", return_sim=True).alias("asdasadadfa"),
pl.col("word").str2.fuzz("apples").alias("LCS based Fuzz match - apples"),
pl.col("word").str2.osa("apples", return_sim = True).alias("Optimal String Alignment - apples"),
pl.col("word").str2.jw("apples").alias("Jaro-Winkler - apples"),
)
shape: (5, 6)
┌──────────┬───────────┬─────────────┬────────────────┬───────────────────────────┬────────────────┐
│ asasasa ┆ sasaaasss ┆ asdasadadfa ┆ LCS based Fuzz ┆ Optimal String Alignment ┆ Jaro-Winkler -
--------- ┆ match - apples ┆ - apple… ┆ apples │
│ f64 ┆ f64 ┆ f64 ┆ ---------
│ ┆ ┆ ┆ f64 ┆ f64 ┆ f64 │
╞══════════╪═══════════╪═════════════╪════════════════╪═══════════════════════════╪════════════════╡
0.1428570.1111110.0909090.8333330.8333330.966667
0.4285710.3333330.2727270.1666670.00.444444
0.1111110.1111110.0909090.5555560.4444440.5
0.8750.6666670.5454550.250.250.527778
0.750.7777780.4545450.250.250.527778
└──────────┴───────────┴─────────────┴────────────────┴───────────────────────────┴────────────────┘
```

And a lot more!
Even in-dataframe nearest neighbors queries! 😲
```python
df.with_columns(
pl.col("id").num.knn_ptwise(
pl.col("val1"), pl.col("val2"),
k = 3, dist = "haversine", parallel = True
).alias("nearest neighbor ids")
)

shape: (5, 6)
┌─────┬──────────┬──────────┬──────────┬──────────┬──────────────────────┐
id ┆ val1 ┆ val2 ┆ val3 ┆ val4 ┆ nearest neighbor ids │
------------------
│ i64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ list[u64] │
╞═════╪══════════╪══════════╪══════════╪══════════╪══════════════════════╡
00.8042260.9370550.4010050.119566 ┆ [0, 3, … 0] │
10.5266910.5623690.0614440.520291 ┆ [1, 4, … 4] │
20.2250550.0803440.4259620.924262 ┆ [2, 1, … 1] │
30.6972640.1122530.6662380.45823 ┆ [3, 1, … 0] │
40.2278070.7349950.2256570.668077 ┆ [4, 4, … 0] │
└─────┴──────────┴──────────┴──────────┴──────────┴──────────────────────┘
```

And a lot more!

# Credits

1. Rust Snowball Stemmer is taken from Tsoding's Seroost project (MIT). See [here](https://github.com/tsoding/seroost)
2. Some statistics functions are taken from Statrs (MIT). See [here](https://github.com/statrs-dev/statrs/tree/master)

# Other related Projects

1. Take a look at our friendly neighbor [functime](https://github.com/TracecatHQ/functime)
2. My other project [dsds](https://github.com/abstractqqq/dsds). This is currently paused because I am developing polars-ds, but some modules in DSDS, such as the diagonsis one, is quite stable.
3. String similarity metrics is soooo fast and easy to use because of [RapidFuzz](https://github.com/maxbachmann/rapidfuzz-rs)
4 changes: 3 additions & 1 deletion docs/polars_ds.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
## Additional Expressions

::: polars_ds
::: polars_ds
options:
filters: ["!(NumExt|StatsExt|StrExt|ComplexExt)", "^__init__$"]
3 changes: 2 additions & 1 deletion docs/requirements-docs.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
mkdocs==1.5.3
mkdocstrings[python]==0.24.0
mkdocs-material==9.5.2
mkdocs-material==9.5.3
mkdocs-section-index==0.3.8
pytkdocs[numpy-style]==0.16.1
Loading

0 comments on commit b7961b7

Please sign in to comment.