RARE

Citation: Dasariraju, S., & Urbanowicz, R. J. (2021). RARE: Evolutionary feature engineering for rare-variant bin discovery. Proceedings of the Genetic and Evolutionary Computation Conference Companion, 1335–1343. https://doi.org/10.1145/3449726.3463174

RARE: Relevant Association Rare-variant-bin Evolver (under development) is an evolutionary algorithm approach to binning rare variants as a rare variant association analysis tool. Please email [email protected] and [email protected] for any inquiries related to RARE.

RARE is an evolutionary algorithm that constructs bins of rare variant features with relevant association to class (univariate and/or multivariate interactions) through the following steps:

Random bin initializaiton or expert knowledge input
Repeated evolutionary cycles consisting of:
- Candidate bin evaluation with univariate scoring (chi-square test) or Relief-based scoring (MultiSURF algorithm); note: new scoring options currently under testing
- Genetic operations (parent selection, crossover, and mutation) to generate the next generation of candidate bins
Final bin evaluation and summary of top bins

In the GECCO '21 folder, please see the RARE_Methods.py file for code definition of the RARE function and its subfunctions. The RAREConstantBinSizeFunctionsDefinition.py file contains code for a modified version of RARE that preserves a constant bin size through initilaization and evolutionary cycles (these files also contain code defining the RVDS functions for the data simulators used to test RARE)

Parameters for RARE:

given_starting_point: whether or not expert knowledge is being inputted (True or False)
amino_acid_start_point: if RARE is starting with expert knowledge, input the list of features here; otherwise None
amino_acid_bins_start_point: if RARE is starting with expert knowledge, input the list of bins of features here; otherwise None
iterations: the number of evolutionary cycles RARE will run
original_feature_matrix: the dataset
label_name: label for the class/endpoint column in the dataset (e.g., 'Class')
rare_variant_MAF_cutoff: the minor allele frequency cutoff separating common features from rare variant features
set_number_of_bins: the population size of candidate bins
min_features_per_group: the minimum number of features in a bin
max_number_of_groups_with_feature: the maximum number of bins containing a feature
scoring_method: 'Univariate', 'Relief', or 'Relief only on bin and common features'
score_based_on_sample: if Relief scoring is used, whether or not bin evaluation is done based on a sample of instances rather than the whole dataset
score_with_common_variables: if Relief scoring is used, whether or not common features should be used as context for evaluating rare variant bins
instance_sample_size: if bin evaluation is done based on a sample of instances, input the sample size here
crossover_probability: the probability of each feature in an offspring bin to crossover to the paired offspring bin (recommendation: 0.5 to 0.8)
mutation_probability: the probability of each feature in a bin to be deleted (a proportionate probability is automatically applied on each feature outside the bin to be added (recommendation: 0.05 to 0.5 depending on situation and number of iterations run)
elitism_parameter: the proportion of elite bins in the current generation to be preserved for the next evolutionary cycle (recommendation: 0.2 to 0.8 depending on conservativeness of approach and number of iterations run)

RARE Variant Data Simulators (RVDSs) are functions that create simulated data for testing/evaluating RARE.

The RVDS for Univariate Association Bin (called RVDS_One_Bin) creates a dataset such that no rare variant feature is 100% predictive of class, but an additive bin of features is fully penetrant to class.
The RVDS for Epistatic Interaction Bin creates a dataset such that no rare variant feature or bin of rare variant features is predictive of class, but an epistatic interaction between a common feature and an additive bin of rare variant features is 100% predictive of class.

In the GECCO '21 folder, please see the RARE_Variant_Data_Simulator_Methods.py file for the code of the two RVDSs.

Parameters for RVDS for Univariate Association Bin:

number_of_instances: number of instances (i.e., rows) desired in the simulated dataset
number_of_features: the total number of rare variant features that should be in the simulated dataset
number_of_features_in_bin: of the number_of_features, how many rare variant features should be binned additively for univariate association to class
rare_variant_MAF_cutoff: the minor allele frequency that all rare variant features but be below
endpoint_cutoff_parameter: "mean" or "median" (recommended "mean")
endpoint_variation_probability: how much noise is desired in the dataset (0 produces a bin with a 100% clear signal, 0.5 can be used as a negative control where class value is randomly assigned)

Parameters for RVDS for Epistatic Interaction Bin:

number_of_instances: number of instances (i.e., rows) desired in the simulated dataset
number_of_rare_features: the total number of rare variant features that should be in the simulated dataset
number_of_features_in_bin: of the number_of_features, how many rare variant features should be binned additively for univariate association to class
rare_variant_MAF_cutoff: the minor allele frequency that all rare variant features but be below
common_feature_genotype_frequencies_list: a list with the genotype frequencies of each of the common feature genotypes (BB, Bb, bb). Should be of the form [0.25, 0.5, 0.25], where 0.25 is the frequency of the BB genotype, 0.5 is the frequency of the Bb genotype, and 0.25 is the frequency of the bb genotype. [0.25, 0.5, 0.25] is recommended
genotype_cutoff_metric: "mean" or "median" (recommended "mean")
endpoint_variation_probability: how much noise is desired in the dataset (0 produces a bin that interacts with a common feature to be fully penetrant, 0.5 can be used as a negative control where class value is randomly assigned)
list of MLGs_predicting_disease: which of the nine MLGs (AABB, AaBB, aaBB, AABb, AaBb, aaBb, AAbb, Aabb, aabb) correspond to a value of 1 in the class column. [AABB, aaBB, AaBb, AAbb, aabb] should be paired with [0.25, 0.5, 0.25] for the common feature genotype frequencies list to create a dataset with pure, strict epistasis
print_summary: whether or not a summary of the simulated datasets with penetrance and frequency values for each of the bin genotypes, common feature genotypes, and MLGs should be printed (True or False)

We evaluate RARE with 9 Experiments contained in the RAREExperiments file, located in the GECCO '21 Folder. Each file contains an example of using an RVDS to create a simulated dataset and also shows how to apply the RARE algorithm on a dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
Development		Development
GECCO '21		GECCO '21
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RARE

About

Packages

Contributors 4

Languages

License

UrbsLab/RARE

Folders and files

Latest commit

History

Repository files navigation

RARE

About

Resources

License

Stars

Watchers

Forks

Packages 0

Contributors 4

Languages

Packages