diff --git a/README.md b/README.md index f8d3e5c..a8fa632 100644 --- a/README.md +++ b/README.md @@ -1,41 +1,43 @@ -[![Python application](https://github.com/enriquea/fsspark/actions/workflows/python-app.yml/badge.svg?branch=main)](https://github.com/enriquea/fsspark/actions/workflows/python-app.yml) -[![Python Package using Conda](https://github.com/enriquea/fsspark/actions/workflows/python-package-conda.yml/badge.svg?branch=main)](https://github.com/enriquea/fsspark/actions/workflows/python-package-conda.yml) +[![Python application](https://github.com/enriquea/fslite/actions/workflows/python-app.yml/badge.svg?branch=main)](https://github.com/enriquea/fslite/actions/workflows/python-app.yml) +[![Python Package using Conda](https://github.com/enriquea/fslite/actions/workflows/python-package-conda.yml/badge.svg?branch=main)](https://github.com/enriquea/fslite/actions/workflows/python-package-conda.yml) -# fsspark +# fslite --- -## Feature selection in Spark +### Memory-Efficient, High-Performance Feature Selection Library for Big and Small Datasets ### Description -`fsspark` is a python module to perform feature selection and machine learning based on spark. -Pipelines written using `fsspark` can be divided roughly in four major stages: 1) data pre-processing, 2) univariate +`fslite` is a python module to perform feature selection and machine learning using pre-built FS pipelines. +Pipelines written using `fslite` can be divided roughly in four major stages: 1) data pre-processing, 2) univariate filters, 3) multivariate filters and 4) machine learning wrapped with cross-validation (**Figure 1**). +`fslite` is based on our previous work [feseR](https://github.com/enriquea/feseR); previously implemented in R and caret package; publication can be found [here](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0189875). + ![Feature Selection flowchart](images/fs_workflow.png) -**Figure 1**. Feature selection workflow example implemented in fsspark. +**Figure 1**. Feature selection workflow example implemented in fslite. ### Documentation The package documentation describes the [data structures](docs/README.data.md) and -[features selection methods](docs/README.methods.md) implemented in `fsspark`. +[features selection methods](docs/README.methods.md) implemented in `fslite`. ### Installation - pip ```bash -git clone https://github.com/enriquea/fsspark.git -cd fsspark +git clone https://github.com/bigbio/fslite.git +cd fslite pip install . -r requirements.txt ``` - conda ```bash -git clone https://github.com/enriquea/fsspark.git -cd fsspark +git clone https://github.com/bigbio/fslite.git +cd fslite conda env create -f environment.yml -conda activate fsspark-venv +conda activate fslite-venv pip install . -r requirements.txt ``` diff --git a/fsspark/tests/test_fsdataframe.py b/fsspark/tests/test_fsdataframe.py index 7b01311..b1ab9e9 100644 --- a/fsspark/tests/test_fsdataframe.py +++ b/fsspark/tests/test_fsdataframe.py @@ -107,4 +107,7 @@ def measure_memory_usage(n_samples: int, n_features: int, nan_prob = 0.01) -> fl plt.show() # Print results table - print(results_df.to_string(index=False)) \ No newline at end of file + print(results_df.to_string(index=False)) + + # Initialize FSDataFrame with DataFrame having sparse numerical features and insufficient memory for dense matrix +