Skip to content

Commit

Permalink
doc: Results parallel (#77)
Browse files Browse the repository at this point in the history
* google analytics setup

* doc: parallel results
  • Loading branch information
mwiewior authored Jan 18, 2025
1 parent 8d0d8be commit 87c8892
Show file tree
Hide file tree
Showing 5 changed files with 26 additions and 9 deletions.
9 changes: 8 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,16 @@
[polars-bio](https://pypi.org/project/polars-bio/) is a Python library for genomics built on top of [polars](https://pola.rs/), [Apache Arrow](https://arrow.apache.org/) and [Apache DataFusion](https://datafusion.apache.org/).
It provides a DataFrame API for genomics data and is designed to be blazing fast, memory efficient and easy to use.


## Single-thread performance 🏃‍
![overlap-single.png](docs/assets/overlap-single.png)

![nearest-single.png](docs/assets/nearest-single.png)
![overlap-single.png](docs/assets/nearest-single.png)

## Parallel performance 🏃‍🏃‍
![overlap-parallel.png](docs/assets/overlap-parallel.png)

![overlap-parallel.png](docs/assets/nearest-parallel.png)
## Key Features
* optimized for [peformance](https://biodatageeks.org/polars-bio/performance/) and large-scale genomics datasets
* popular genomics [operations](https://biodatageeks.org/polars-bio/features/#genomic-ranges-operations) with a DataFrame API (both [Pandas](https://pandas.pydata.org/) and [polars](https://pola.rs/))
Expand Down
Binary file added docs/assets/nearest-parallel.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/overlap-parallel.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
11 changes: 11 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,17 @@ polars-bio is a :rocket:blazing [fast](performance.md#results-summary-) Python D
and [polars](https://pola.rs/).
It is designed to be easy to use, fast and memory efficient with a focus on genomics data.

## Single-thread performance 🏃‍
![overlap-single.png](assets/overlap-single.png)

![overlap-single.png](assets/nearest-single.png)

## Parallel performance 🏃‍🏃‍
![overlap-parallel.png](assets/overlap-parallel.png)

![overlap-parallel.png](assets/nearest-parallel.png)


## Key Features
* optimized for [peformance](performance.md#results-summary-) and large-scale genomics datasets
* popular genomics [operations](features.md#genomic-ranges-operations) with a DataFrame API (both [Pandas](https://pandas.pydata.org/) and [polars](https://pola.rs/))
Expand Down
15 changes: 7 additions & 8 deletions docs/performance.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,14 @@
# Results summary 📈


!!! todo
- Add summary of the results

## Single-threaded performance 🏃‍
## Single-thread performance 🏃‍
![overlap-single.png](assets/overlap-single.png)

![overlap-single.png](assets/nearest-single.png)

## Parallel performance 🏃‍🏃‍🏃‍
## Parallel performance 🏃‍🏃‍
![overlap-parallel.png](assets/overlap-parallel.png)

![overlap-parallel.png](assets/nearest-parallel.png)
## Benchmarks 🧪
### Detailed results shortcuts 👨‍🔬
- [Binary operations](#binary-operations)
Expand All @@ -35,7 +34,7 @@
!!! note
Test dataset in *Parquet* format can be downloaded from:

* for [single-threaded](https://drive.google.com/file/d/1lctmude31mSAh9fWjI60K1bDrbeDPGfm/view?usp=sharing) tests
* for [single-thread](https://drive.google.com/file/d/1lctmude31mSAh9fWjI60K1bDrbeDPGfm/view?usp=sharing) tests
* for [parallel](https://drive.google.com/file/d/1Sj7nTB5gCUq9nbeQOg4zzS4tKO37M5Nd/view?usp=sharing) tests (8 partitions per dataset)

### Test libraries 📚
Expand Down Expand Up @@ -720,7 +719,7 @@ pb.ctx.set_option("datafusion.optimizer.repartition_joins", "true")
pb.ctx.set_option("datafusion.optimizer.repartition_file_scans", "true")
pb.ctx.set_option("datafusion.execution.coalesce_batches", "false")
```
the `single-threaded` dataset was used (see [Test datasets](#test-datasets))
the `single-thread` dataset was used (see [Test datasets](#test-datasets))


- `polars_bio-n-p`: Custom partitioning schema (constant number of 8 partitions/dataset) without any repartitioning in DataFusion:
Expand Down

0 comments on commit 87c8892

Please sign in to comment.