scikit-RARE

RARE: Relevant Association Rare-variant-bin Evolver (see citation) is an evolutionary algorithm approach to binning rare variants as a rare variant association analysis tool. scikit-RARE is scikit compatible pypi package for the RARE algotithm.

RARE constructs bins of rare variant features with relevant association to class (univariate and/or multivariate interactions) through the following steps:

Random bin initializaiton or expert knowledge input
Repeated evolutionary cycles consisting of:
- Candidate bin evaluation with univariate scoring (chi-square test) or Relief-based scoring (MultiSURF algorithm); note: new scoring options currently under testing
- Genetic operations (parent selection, crossover, and mutation) to generate the next generation of candidate bins
Final bin evaluation and summary of top bins

Installation

We can easily install scikit-rare using the following command:

pip install scikit-rare

Parameters for RARE Class:

given_starting_point: whether or not expert knowledge is being inputted (True or False)
amino_acid_start_point: if RARE is starting with expert knowledge, input the list of features here; otherwise None
amino_acid_bins_start_point: if RARE is starting with expert knowledge, input the list of bins of features here; otherwise None
iterations: the number of evolutionary cycles RARE will run
original_feature_matrix: the dataset
label_name: label for the class/endpoint column in the dataset (e.g., 'Class')
rare_variant_MAF_cutoff: the minor allele frequency cutoff separating common features from rare variant features
set_number_of_bins: the population size of candidate bins
min_features_per_group: the minimum number of features in a bin
max_number_of_groups_with_feature: the maximum number of bins containing a feature
scoring_method: 'Univariate', 'Relief', or 'Relief only on bin and common features'
score_based_on_sample: if Relief scoring is used, whether or not bin evaluation is done based on a sample of instances rather than the whole dataset
score_with_common_variables: if Relief scoring is used, whether or not common features should be used as context for evaluating rare variant bins
instance_sample_size: if bin evaluation is done based on a sample of instances, input the sample size here
crossover_probability: the probability of each feature in an offspring bin to crossover to the paired offspring bin (recommendation: 0.5 to 0.8)
mutation_probability: the probability of each feature in a bin to be deleted (a proportionate probability is automatically applied on each feature outside the bin to be added (recommendation: 0.05 to 0.5 depending on situation and number of iterations run)
elitism_parameter: the proportion of elite bins in the current generation to be preserved for the next evolutionary cycle (recommendation: 0.2 to 0.8 depending on conservativeness of approach and number of iterations run)
random_seed: the seed value needed to generate a random number 19)bin_size_variability_constraint: sets the max bin size of children to be n times the size of their sibling (recommendation: 2, with larger or smaller values the population would trend heavily towards small or large bins without exploring the search space)
max_features_per_bin: sets a max value for the number of features per bin

Parameters for RARE Methods

RARE Variant Data Simulators (RVDSs)

RARE Variant Data Simulators (RVDSs) are functions that create simulated data for testing/evaluating RARE.

The RVDS for Univariate Association Bin (called RVDS_One_Bin) creates a dataset such that no rare variant feature is 100% predictive of class, but an additive bin of features is fully penetrant to class.
The RVDS for Epistatic Interaction Bin creates a dataset such that no rare variant feature or bin of rare variant features is predictive of class, but an epistatic interaction between a common feature and an additive bin of rare variant features is 100% predictive of class.

Parameters for RVDS for Univariate Association Bin:

number_of_instances: number of instances (i.e., rows) desired in the simulated dataset
number_of_features: the total number of rare variant features that should be in the simulated dataset
number_of_features_in_bin: of the number_of_features, how many rare variant features should be binned additively for univariate association to class
rare_variant_MAF_cutoff: the minor allele frequency that all rare variant features but be below
endpoint_cutoff_parameter: "mean" or "median" (recommended "mean")
endpoint_variation_probability: how much noise is desired in the dataset (0 produces a bin with a 100% clear signal, 0.5 can be used as a negative control where class value is randomly assigned)

Parameters for RVDS for Epistatic Interaction Bin:

number_of_instances: number of instances (i.e., rows) desired in the simulated dataset
number_of_rare_features: the total number of rare variant features that should be in the simulated dataset
number_of_features_in_bin: of the number_of_features, how many rare variant features should be binned additively for univariate association to class
rare_variant_MAF_cutoff: the minor allele frequency that all rare variant features but be below
common_feature_genotype_frequencies_list: a list with the genotype frequencies of each of the common feature genotypes (BB, Bb, bb). Should be of the form [0.25, 0.5, 0.25], where 0.25 is the frequency of the BB genotype, 0.5 is the frequency of the Bb genotype, and 0.25 is the frequency of the bb genotype. [0.25, 0.5, 0.25] is recommended
genotype_cutoff_metric: "mean" or "median" (recommended "mean")
endpoint_variation_probability: how much noise is desired in the dataset (0 produces a bin that interacts with a common feature to be fully penetrant, 0.5 can be used as a negative control where class value is randomly assigned)
list of MLGs_predicting_disease: which of the nine MLGs (AABB, AaBB, aaBB, AABb, AaBb, aaBb, AAbb, Aabb, aabb) correspond to a value of 1 in the class column. [AABB, aaBB, AaBb, AAbb, aabb] should be paired with [0.25, 0.5, 0.25] for the common feature genotype frequencies list to create a dataset with pure, strict epistasis
print_summary: whether or not a summary of the simulated datasets with penetrance and frequency values for each of the bin genotypes, common feature genotypes, and MLGs should be printed (True or False)

Citation

Dasariraju, S., & Urbanowicz, R. J. (2021). RARE: Evolutionary feature engineering for rare-variant bin discovery. Proceedings of the Genetic and Evolutionary Computation Conference Companion, 1335–1343. https://doi.org/10.1145/3449726.3463174

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github/workflows		.github/workflows
data		data
docs		docs
src/skrare		src/skrare
tests		tests
.gitignore		.gitignore
.pipyrc		.pipyrc
LICENSE		LICENSE
RARE-JupyterNotebook.ipynb		RARE-JupyterNotebook.ipynb
README.md		README.md
project.toml		project.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scikit-RARE

Installation

Parameters for RARE Class:

Parameters for RARE Methods

RARE Variant Data Simulators (RVDSs)

Parameters for RVDS for Univariate Association Bin:

Parameters for RVDS for Epistatic Interaction Bin:

Citation

About

Releases 1

Packages

Languages

License

UrbsLab/scikit-RARE

Folders and files

Latest commit

History

Repository files navigation

scikit-RARE

Installation

Parameters for RARE Class:

Parameters for RARE Methods

RARE Variant Data Simulators (RVDSs)

Parameters for RVDS for Univariate Association Bin:

Parameters for RVDS for Epistatic Interaction Bin:

Citation

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages