Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor py #12

Open
wants to merge 64 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
82e5c3b
first line
ypriverol Sep 19, 2024
deb2df6
first iteration of pandas fdataframe.py
ypriverol Sep 19, 2024
70fec44
first iteration of pandas fdataframe.py
ypriverol Sep 19, 2024
b99aee0
first iteration of pandas fdataframe.py
ypriverol Sep 19, 2024
64fb1aa
update fdataframe class
enriquea Sep 20, 2024
cc471f0
minor refactory
enriquea Sep 20, 2024
30e0659
Merge branch 'refactor-py' of https://github.com/bigbio/fsspark into …
enriquea Sep 20, 2024
471dafa
first iteration of pandas fdataframe.py
ypriverol Sep 20, 2024
66d6118
Merge remote-tracking branch 'origin/refactor-py' into refactor-py
ypriverol Sep 20, 2024
174196a
first iteration of pandas fdataframe.py
ypriverol Sep 20, 2024
fa0d320
first iteration of pandas fdataframe.py
ypriverol Sep 20, 2024
0a8080b
added test univariate corr
enriquea Sep 20, 2024
8558656
refactor univariate methods (corr)
enriquea Sep 20, 2024
d2ca24d
update
enriquea Sep 20, 2024
516b4c6
added methods to select features and update FSDataFrame
enriquea Sep 20, 2024
a787707
move from unitests to pytests
ypriverol Sep 20, 2024
f75093d
move from unitests to pytests
ypriverol Sep 20, 2024
f15b4e8
minor changes to store sparse matrices
ypriverol Sep 21, 2024
ea15b18
fsspark -> fslite
ypriverol Sep 22, 2024
a4de03c
fsspark -> fslite
ypriverol Sep 22, 2024
032a422
better structure for methods in constants.py
ypriverol Sep 22, 2024
c2312c8
better structure for methods in constants.py
ypriverol Sep 22, 2024
10ee2e8
fsspark -> fslite
ypriverol Sep 22, 2024
a69ac12
Minor changes in constants.py
ypriverol Sep 22, 2024
3f56ded
black applied
ypriverol Sep 22, 2024
1fafeb5
clean more code.
ypriverol Sep 22, 2024
f2ce664
clean more code.
ypriverol Sep 22, 2024
6d1f54a
update in dependencies
ypriverol Sep 22, 2024
a0181aa
update in dependencies
ypriverol Sep 22, 2024
4a93621
update in dependencies
ypriverol Sep 22, 2024
5d70dfc
update in dependencies
ypriverol Sep 22, 2024
0eddddd
update in dependencies
ypriverol Sep 22, 2024
94703eb
smaller tests for CI/CD
ypriverol Sep 22, 2024
f67a259
smaller tests for CI/CD
ypriverol Sep 23, 2024
7a08e82
Another refactoring
ypriverol Sep 23, 2024
5e56b21
Another refactoring
ypriverol Sep 23, 2024
9b74ada
Another refactoring
ypriverol Sep 23, 2024
b1c4ad5
refactoring ml methods
ypriverol Sep 23, 2024
c657be9
refactoring ml methods
ypriverol Sep 23, 2024
c46167c
added file for experiments
ypriverol Sep 23, 2024
7b06d1e
minor comments
ypriverol Sep 23, 2024
35f58a2
minor refinements
ypriverol Sep 23, 2024
43dddb7
minor refinements
ypriverol Sep 23, 2024
b6e8eab
added example script to parse single-cell data
enriquea Sep 23, 2024
07a9dc5
implemented univariate selector methods (from sci-learn) and added te…
enriquea Sep 23, 2024
6c29cd8
added implementation for multivariate methods: variance and matrix_co…
enriquea Sep 24, 2024
5cbd7da
added tests for multivariate
enriquea Sep 24, 2024
cc493f6
loom2parquet examples
ypriverol Sep 24, 2024
4250a4e
Update fslite/fs/utils.py
ypriverol Sep 25, 2024
cc4e794
Update fslite/tests/test_ml_methods.py
ypriverol Sep 25, 2024
0ccd98d
Update fslite/tests/generate_big_tests.py
ypriverol Sep 25, 2024
0e24e2c
Update fslite/tests/generate_big_tests.py
ypriverol Sep 25, 2024
82a1a86
delete ML methods
ypriverol Sep 25, 2024
e2f7b9c
delete ML methods
ypriverol Sep 25, 2024
5a91f14
Update examples/loom2parquetmerge.py
ypriverol Sep 25, 2024
718b743
Update fslite/tests/generate_big_tests.py
ypriverol Sep 25, 2024
681a823
Update fslite/fs/methods.py
ypriverol Sep 25, 2024
8608117
Update fslite/fs/ml.py
ypriverol Sep 25, 2024
6ecbaca
Update fslite/fs/fdataframe.py
ypriverol Sep 25, 2024
d5cc974
Update fslite/fs/multivariate.py
ypriverol Sep 25, 2024
7ee27c8
delete ML methods
ypriverol Sep 25, 2024
d1f74d6
delete ML methods
ypriverol Sep 25, 2024
3909487
refactoring parquet SC generation
ypriverol Sep 25, 2024
07cb771
small changes
ypriverol Sep 26, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 15 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,41 +1,43 @@
[![Python application](https://github.com/enriquea/fsspark/actions/workflows/python-app.yml/badge.svg?branch=main)](https://github.com/enriquea/fsspark/actions/workflows/python-app.yml)
[![Python Package using Conda](https://github.com/enriquea/fsspark/actions/workflows/python-package-conda.yml/badge.svg?branch=main)](https://github.com/enriquea/fsspark/actions/workflows/python-package-conda.yml)
[![Python application](https://github.com/bigbio/fslite/actions/workflows/python-app.yml/badge.svg?branch=main)](https://github.com/enriquea/fslite/actions/workflows/python-app.yml)
[![Python Package using Conda](https://github.com/bigbio/fslite/actions/workflows/python-package-conda.yml/badge.svg?branch=main)](https://github.com/bigbio/fslite/actions/workflows/python-package-conda.yml)

# fsspark
# fslite

---

## Feature selection in Spark
### Memory-Efficient, High-Performance Feature Selection Library for Big and Small Datasets

### Description

`fsspark` is a python module to perform feature selection and machine learning based on spark.
Pipelines written using `fsspark` can be divided roughly in four major stages: 1) data pre-processing, 2) univariate
`fslite` is a python module to perform feature selection and machine learning using pre-built FS pipelines.
Pipelines written using `fslite` can be divided roughly in four major stages: 1) data pre-processing, 2) univariate
filters, 3) multivariate filters and 4) machine learning wrapped with cross-validation (**Figure 1**).

`fslite` is based on our previous work [feseR](https://github.com/enriquea/feseR); previously implemented in R and caret package; publication can be found [here](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0189875).

![Feature Selection flowchart](images/fs_workflow.png)
**Figure 1**. Feature selection workflow example implemented in fsspark.
**Figure 1**. Feature selection workflow example implemented in fslite.

### Documentation

The package documentation describes the [data structures](docs/README.data.md) and
[features selection methods](docs/README.methods.md) implemented in `fsspark`.
[features selection methods](docs/README.methods.md) implemented in `fslite`.

### Installation

- pip
```bash
git clone https://github.com/enriquea/fsspark.git
cd fsspark
git clone https://github.com/bigbio/fslite.git
cd fslite
pip install . -r requirements.txt
```

- conda
```bash
git clone https://github.com/enriquea/fsspark.git
cd fsspark
git clone https://github.com/bigbio/fslite.git
cd fslite
conda env create -f environment.yml
conda activate fsspark-venv
conda activate fslite-venv
pip install . -r requirements.txt
```

Expand Down
4 changes: 4 additions & 0 deletions docs/EXPERIMENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
## Experiments and Benchmarks

This document contains the experiments and benchmarks that were conducted to evaluate the performance of fslite.
The experiments were conducted on the following datasets:
32 changes: 17 additions & 15 deletions docs/README.data.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
## fsspark - data structures
## fslite - data structures

---

`fsspark` is a Python package that provides a set of tools for feature selection in Spark.
Here we describe the main data structures used in `fsspark` and how to use them.
`fslite` is a Python package that provides a set of tools for feature selection in Spark.
Here we describe the main data structures used in `fslite` and how to use them.

### Input data

Expand Down Expand Up @@ -32,30 +32,32 @@ The following is an example of a TSV file with a binary response variable:

### Import functions

`fsspark` provides two main functions to import data from a TSV file.
`fslite` provides two main functions to import data from a TSV file.

- `import_table` - Import data from a TSV file into a Spark Data Frame (sdf).

```python
from fsspark.utils.io import import_table
sdf = import_table('data.tsv.bgz',
sep='\t',
n_partitions=5)
from fslite.utils.io import import_table

sdf = import_table('data.tsv.bgz',
sep='\t',
n_partitions=5)
```

- `import_table_as_psdf` - Import data from a TSV file into a Spark Data Frame (sdf) and
convert it into a Pandas on Spark Data Frame (psdf).

```python
from fsspark.utils.io import import_table_as_psdf
psdf = import_table_as_psdf('data.tsv.bgz',
sep='\t',
from fslite.utils.io import import_table_as_psdf

psdf = import_table_as_psdf('data.tsv.bgz',
sep='\t',
n_partitions=5)
```

### The Feature Selection Spark Data Frame (FSDataFrame)

The `FSDataFrame` (**Figure 1**) is a core functionality of `fsspark`. It is a wrapper around a Spark Data Frame (sdf)
The `FSDataFrame` (**Figure 1**) is a core functionality of `fslite`. It is a wrapper around a Spark Data Frame (sdf)
that provides a set of methods to facilitate feature selection tasks. The `FSDataFrame` is initialized
with a Spark Data Frame (sdf) or a Pandas on Spark Data Frame (psdf) and two mandatory arguments:
`sample_col` and `label_col`. The `sample_col` argument is the name of the column in the sdf that
Expand All @@ -73,9 +75,9 @@ contains the response variable.
#### How to create a Feature Selection Spark Data Frame (FSDF)

```python
from fsspark.config.context import init_spark, stop_spark_session
from fsspark.fs.core import FSDataFrame
from fsspark.utils.io import import_table_as_psdf
from fslite.config.context import init_spark, stop_spark_session
from fslite.fs.core import FSDataFrame
from fslite.utils.io import import_table_as_psdf

# Init spark
init_spark()
Expand Down
8 changes: 4 additions & 4 deletions docs/README.methods.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@

# fsspark - features selection methods
# fslite - features selection methods

---

`fsspark `includes a set of methods to perform feature selection and machine learning based on spark.
A typical workflow written using `fsspark` can be divided roughly in four major stages:
`fslite `includes a set of methods to perform feature selection and machine learning based on spark.
A typical workflow written using `fslite` can be divided roughly in four major stages:

1) data pre-processing.
2) univariate filters.
Expand Down Expand Up @@ -53,4 +53,4 @@ A typical workflow written using `fsspark` can be divided roughly in four major

### 5. Feature selection pipeline example

[FS pipeline example](../fsspark/pipeline/fs_pipeline_example.py)
[FS pipeline example](../fslite/pipeline/fs_pipeline_example.py)
19 changes: 12 additions & 7 deletions environment.yml
Original file line number Diff line number Diff line change
@@ -1,14 +1,19 @@
name: fsspark-venv
name: fslite-venv
channels:
- defaults
- conda-forge
dependencies:
- python==3.10
- pip
- pip:
- setuptools~=65.5.0
- pyspark~=3.3.0
- networkx~=2.8.7
- numpy~=1.23.4
- pandas~=1.5.1
- pyarrow~=8.0.0
- setuptools
- networkx
- numpy
- pyarrow
- pandas
- scipy
- scikit-learn
- psutil
- pytest
- matplotlib
- memory-profiler
136 changes: 136 additions & 0 deletions examples/loom2parquetchunks.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# Import and convert to parquet a single-cell dataset: GSE156793 (loom format)
# GEO URL:
# https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE156793&format=file&file=GSE156793%5FS3%5Fgene%5Fcount%2Eloom%2Egz

# import libraries
import pandas as pd
import loompy
import pyarrow.parquet as pq
import pyarrow as pa

# define the path to the loom file
loom_file = "GSE156793_S3_gene_count.loom"

# connect to the loom file
ds = loompy.connect(loom_file)

# get shape of the data
ds.shape

# retrieve the row attributes
ds.ra.keys()

# get gene ids
gene_ids = ds.ra["gene_id"]
gene_ids[0:10]

# get the column attributes
ds.ca.keys()

# get sample metadata
sample_id = ds.ca["sample"]
cell_cluster = ds.ca["Main_cluster_name"]
assay = ds.ca["Assay"]
development_day = ds.ca["Development_day"]

# make a dataframe with the sample metadata, define the columns types
sample_df = pd.DataFrame({
"sample_id": sample_id,
"cell_cluster": cell_cluster,
"assay": assay,
"development_day": development_day,
}
)

# print the first 5 rows
sample_df.head()

# Make 'cell_cluster' a categorical variable encoded as an integer
sample_df["cell_cluster"] = sample_df["cell_cluster"].astype("category")
sample_df["cell_cluster_id"] = sample_df["cell_cluster"].cat.codes

# Make 'assay' a categorical variable encoded as an integer
sample_df["assay"] = sample_df["assay"].astype("category")
sample_df["assay_id"] = sample_df["assay"].cat.codes

# Make 'sample_id' the index
sample_df = sample_df.set_index("sample_id")

# Show the first 5 rows
sample_df.head()

# Save the sample metadata to parquet
(
sample_df.reset_index().to_parquet(
"sample_metadata.parquet", index=False, engine="auto", compression="gzip"
)
)


# transpose dataset and convert to parquet.
# process the data per chunks.
chunk_size = 10000
writer = None
count = 0
number_chunks = 10 # number of chunks to process

for ix, selection, view in ds.scan(axis=1, batch_size=chunk_size):
# retrieve the chunk
matrix_chunk = view[:, :]

# transpose the data
matrix_chunk_t = matrix_chunk.T

# convert to pandas dataframe
df_chunk = pd.DataFrame(
matrix_chunk_t, index=sample_id[selection.tolist()], columns=gene_ids
)

# merge chunk with sample metadata
df_chunk = pd.merge(
left=sample_df[["cell_cluster_id", "development_day", "assay_id"]],
right=df_chunk,
how="inner",
left_index=True,
right_index=True,
sort=False,
copy=True,
indicator=False,
validate="one_to_one",
)

# reset the index
df_chunk = df_chunk.reset_index()

# rename the index column
df_chunk = df_chunk.rename(columns={"index": "sample_id"})

if writer is None:
# define the schema
schema = pa.schema(
[
pa.field("sample_id", pa.string()),
pa.field("cell_cluster_id", pa.int8()),
pa.field("development_day", pa.int64()),
pa.field("assay_id", pa.int8()),
]
+ [pa.field(gene_id, pa.float32()) for gene_id in gene_ids]
)

print(len(list(df_chunk.columns)))
print(len(schema))

# create the parquet writer
writer = pq.ParquetWriter("GSE156793.parquet", schema, compression="snappy")

writer.write_table(pa.Table.from_pandas(df_chunk, preserve_index=False))

print(f"Chunk {ix} saved")

count += 1
if count >= number_chunks:
break

if writer is not None:
writer.close()
print(f"Concatenated parquet file written to GSE156793.parquet")
File renamed without changes.
File renamed without changes.
Loading
Loading