'Gene Set Meta path Analysis for Pattern Regression', MAPR, is a tools that addresses the bioinformatics task to identify the genes most related to a given gene set using heterogeneous networks of gene-gene relationships. It relies on a supervised learning algorithm that for the first time explicitly considers the patterns of direct as well as indirect relationships present in the network to quantify gene-gene similarities. More information about the method and its evaluations can be found at the manuscript at bioRxiv :
Gene Sets Analysis using Network Patterns.
Gregory Linkowski, Charles Blatti, Krishna Kalari, Saurabh Sinha, Shobha Vasudevan
bioRxiv 629816; doi: https://doi.org/10.1101/629816
Questions or feedback should be directed to Charles Blatti ([email protected]).
Through its multi-step process, the MAPR tool takes in a description of a heterogeneous network of gene-gene relations and one (or more) input gene sets, and returns a ranking of all genes in the network by their meta-path based similarity to the input set. The three primary steps of the algorithm are as follows:
- Prepare the Heterogeneous Network: given an
edge_file
containing raw network edges, quantify meta paths through heterogeneous network - Create Meta-Path Features: given
input_sets
of gene sets, use cross-validation and convert meta-paths into gene-based features for future learning - Learning Ensemble to Rank Genes: From meta path features, train an ensemble of LASSO-based classifiers to rank genes for membership to input set.
The major parameters of the method are the types of meta-paths that are to be considered (keep_types
), the lengths of those meta paths (max_metapath_length
), the number of folds in cross-validation (cv_folds
), and the number of classifiers in the ensemble predictor (num_models
).
Below is a diagram from the paper that shows a schematic for each of the three primary steps:
The final version of GeneSet MAPR was written and tested in Python 3.5. Certain functionality is known to not work with version 2.7 and earlier.
The only non-standard python packages that are required are NumPy and scikit-Learn. Exact package numbers are listed in the requirements.txt
file.
In this section, we will walk through a quick example with demo data. In this example we will use a heterogeneous network with many edge types. We will calculate all meta-paths of length 1, 2, and 3. We will then convert these meta-path statistics into gene features for two different input gene sets. Finally, we will use those features to train an ensemble of classifiers to return a ranked list of related genes for each of our input sets.
The first step of network processing is handled by the script: MAPR_networkPrep.py
.
This script requires the network_prefix
, the path (relative or fixed) to the two network files provided by the user, 1) the .edge.txt
file that contains the edges of the heterogeneous network and 2) the .keep.txt
file that describes the different network node and edge types. Information about these file formats are found in the MAPR Resources input formats section.
The MAPR_networkPrep.py
script will create the meta-path matrices and other supporting files, such as lists of all unique genes or non-gene terms in the network. By default this will be created in the same directory as the network location, but can be changed with an optional flag, -n
.
Optional arguments are specified with flags, and include:
-k
An alternative keep file location. Otherwise assumed to be in same directory as the edge file, named with the same name (ie:network_prefix
.keep.txt).-l
The maximum length of meta-paths to calculate. Currently, the script will calculate paths up to lengths of 3 edges, but this can be set to 2 or 1.-n
An alternative output location for the processed network. By default the processed network is saved in the same directory as the provided edge file.-t
Whether to save each single-edge sub-network. Default value is False. If set to True, then an edge file will be saved for each individual edge type in the original network file.-v
How much feedback to receive at the command line. Default value is 0, but can be set to 1 or 2.
An example command below explicitly specifies the location of the .keep.txt
file, a different output location, limiting meta-paths to lengths of no greater than 3 edges, saving .txt sub-network edge lists, and specifying verbose terminal output:
python3 MAPR_networkPrep.py ./demo_data/networks/demo_net.edge.txt \
-k ./demo_data/networks/demo_net.keep.txt -n demo_outputs/networks/ -l 3 -t TRUE -v 1
The warnings in this demo indicate meta-paths that fail to connect any nodes. This may happen when very sparse sub-networks are combined, and no nodes use edges from both edge types. Such meta-paths won't influence the final prediction.
More information about the output network and MetaPath files and formats can be found in the MAPR Resources file formats.
The second step of building meta-path related features for input gene sets are calculated by: MAPR_buildFeatures.py
.
The script requires 1 argument specifying the name of the network to use. It also will use a directory that contains the input gene set samples. By default, it assumes the processed network is stored in './networks/', the gene sets to be processed are in './gene_sets/', and the resulting output will be saved in './output', but these defaults can be overridden with optional flags.
The script will calculate meta-path connections for all gene sets in the samples directory for multiple cross-validation folds of the data. Information about the format and naming of the gene set files within the set directory are found in the MAPR Resources input formats section.
Optional arguments are specified with flags, and include:
-n
Directory where the processed network is stored. Default is './network'-s
Directory where one or more gene set text files are stored. Default is './samples'-o
Directory where the output will be saved. Default is './output'-f
The number of folds to use in cross-validation. Default value is 4.-v
How much feedback to receive at the command line. Default value is 0, but can be set to 1 or 2.
To continue our demo example (while specifying a network saved to an alternative location, the gene set input directory, 4 folds for cross-validation, an alternative output folder, and verbose terminal feedback), we can use the following command:
python3 MAPR_buildFeatures.py demo_net -n demo_outputs/networks/ \
-s demo_data/gene_sets/ -o demo_outputs/features/ -f 4 -v 1
More information about the output batch of feature files and formats in the printed to the batch_directory
can be found in the MAPR Resources file formats.
The final step of the learning an ensemble of classifiers for predicting and ranking genes for membership in the input sample gene sets is performed by MAPR_characterizeSet.py
The script requires the batch_directory
, an argument specifying the location of the meta-path features across the sample partitions that is the output by the MAPR_buildFeatures.py script. The MAPR_characterizeSet.py
script builds multiple models for the processed gene sets to learn how the genes are interconnected and creates a ranked list of all genes by patterns of connectedness. This is done both for the full gene set and each of several fold partitions. The Area Under the Receiver Operating Characteristic Curve (AUC) is reported for each fold to report how well the learned models predict the original gene set -- or how well they fit the data.
Optional arguments for the MAPR_characterizeSet.py
script are specified with flags, and include:
-m
Number of models to create, affects stability of results. Default is 101.-l
Maximum meta-path length to consider. Default is 3.-i
Path and filename of a text file containing edge types to omit from consideration. Default is 'NONE'.-v
How much feedback to receive at the command line. Default value is 0, but can be set to 1 or 2.
An example of this script specifying a maximum considered meta-path length of 3, the creation of only 31 models per set, and verbose terminal feedback is below.
python3 MAPR_characterizeSet.py ./demo_outputs/features/batch-000/ -m 31 -l 3 -v 1
The final outputs of this MAPR tool are the ranked gene lists and the AUC performance files for each partition. More information about the files and their formats can be found in the MAPR Resources file formats.
The following sections have additional detail about the file formats produced and consumed by the MAPR scripts.
Before running MAPR, users will need to prepare the following files:
The network edge file specifies edges in the network and an example can be found at ./demo_data/networks/demo_net.edge.txt
. Each line has four columns specifying an origin node, a terminal node, a weight, and an edge type. In the example below, three edge types are shown: GO_MolFunc, pfam_domain, and String_coexpression. Gene names begin with "ENSG", while gene ontology terms begin with "GO_" and protein families with "pfam_".
ENSG00000004779 ENSG00000006715 681 STRING_coexpression
ENSG00000004779 ENSG00000005249 537 STRING_coexpression
ENSG00000004779 ENSG00000007923 681 STRING_coexpression
ENSG00000004779 ENSG00000008018 647 STRING_coexpression
GO_0001085 ENSG00000005339 2 GO_MolFunc
GO_0001085 ENSG00000004487 1 GO_MolFunc
GO_0001102 ENSG00000005339 2 GO_MolFunc
Pdom_Ank ENSG00000001629 7.39 pfam_domain
Pdom_Ank ENSG00000001631 7.25 pfam_domain
Pdom_SH3_1 ENSG00000007264 5.33 pfam_domain
Pdom_SH3_1 ENSG00000008735 4.85 pfam_domain
Pdom_SH3_2 ENSG00000000938 6.73 pfam_domain
Pdom_SH3_2 ENSG00000002834 5.25 pfam_domain
The STRING_coexpression edges are direct gene to gene edges. As such, the first two columns contain gene names. Both GO_MolFunc and pfam_domain edge types group genes into shared terms. For those, the first column contains the term (eg. Pdom_Ank) while the second column is one of the genes in that term. The third column scores the strength of that particular edge, or connection. Note that the way scores are calculated and defined may vary greatly from one edge type to another. MAPR will normalize the scores for each edge type to allow more equivalent comparison of weights across types.
The keep file tells MAPR how to distinguish between gene and term nodes, as well as indicate which edges to keep or discard when processing the network. An example can be found at ./demo_data/networks/demo_net.keep.txt
. The keep file consists of three sections, indicated by the required headers "GENE TYPES", "EDGE TYPES", and "CUTOFF RANGE". Here is an example keep file excerpt with the three sections and the columns within each section shown:
GENE TYPES
human1 ENSG keep
human2 LRG_ no
EDGE TYPES
GO_MolFunc indirect keep
kegg_pathway indirect no
pfam_domain indirect keep
STRING_coexpression direct no
CUTOFF RANGE
GO_MolFunc 10 3000
kegg_pathway 10 3000
pfam_domain 10 3000
Under "GENE TYPES" are three columns indicating species, the pattern of characters occurring at the start of each gene of that type (for use by regex), and the indicator whether to keep or discard the genes in that row. Setting the indicator column to "keep" or "yes" will keep genes of that type, while "no" will discard them.
Note that genes must start with some identifying set of characters. In the provided example network, all genes used either the Ensemble Gene ID or the Locus Reference Genomic identifier. Therefore, every gene name began with either "ENSG" or "LRG". If you use gene names with much wider variation, you may wish to use a prefix such as "gname_" to distinguish genes from non-gene terms.
The EDGE TYPES section has three columns: the name of the edge type, whether it is "direct" (gene-to-gene connection) or "indirect" (genes belong to a shared term), and an indicator whether to keep that edge type (ie: "keep" or "yes").
The CUTOFF RANGE section is specific to indirect edge types. The three columns there are: edge type, minimum considered term size, and maximum considered term size. For instance, the file example_alternate.keep.txt has the row "GO_MolFunc 10 3000", indicating that all terms containing less than 10 genes or more than 3000 will be ignored. One may wish to do this to filter out terms that are considered either too noisy and subjective, or too broad to be useful.
By default, the MAPR scripts will look for a keep file with the same name as the edge file. For example, if the provided edge file is named "network_name.edge.txt", the script will look for "network_name.keep.txt". This can be changed using the '-k' command-line argument.
Gene sets should be provided as text files (examples shown in ./demo_data/gene_sets/). They may be given as one or two files. In the case that up- and down-regulated genes are provided as two separate files to be treated as one gene set, the files should begin with the same name and end with either '_DN.txt' or '_UP.txt'.
Genes need not be sorted. The script will remove any redundant entries, and ignore genes that don't appear in the network.
An example of the first several lines in a sample gene set:
ENSG00000007047
ENSG00000008294
ENSG00000002919
ENSG00000005206
ENSG00000002726
ENSG00000006534
ENSG00000006712
...
There are many outputs that will be produced when MAPR_networkPrep.py
is run. These will be stored in the location of the input network unless the -n
flag is used.
- Processed Network Files will appear together in the output directory that is the
network_name
- edges.txt - complete list of valid edge types in the processed network
- genes.txt - complete list of valid gene names in the processed network
- indices.txt - map from gene name to processed edge index in "network.txt" output
- keep.txt - processed ".keep.txt"-format input file
- network.txt - processed ".edge.txt"-format input file
- node-degree.txt - gene by edge type matrix with each entry being the node degree at that gene node for edges of that type
- For each
edge_type
- [network_name]-[edge_type].txt - list of processed network edges with weights for the specified
edge_type
- [network_name]-[edge_type].txt - list of processed network edges with weights for the specified
- MetaPath Files will appear together in the output directory that is named
[network_name]_MetaPaths
- key.txt - mapping between id for each meta path type,
meta_path_index
, and the human-readable description of the meta path - key_primaries.txt - mapping of ids for all the meta paths of length 1
- key_termNonGene.txt - mapping of ids for all the meta paths of length 1 that were collapsed to direct gene-gene edges from indirect property-gene edges
- For each
meta_path_index
that is an identifier for a meta path type- [meta_path_index].gz - zipped matrix where the rows and columns are ordered by genes.txt and the values are the score for the meta_path_index type of the file
- [meta_path_index]tl.txt - list of all property labels for the indirect property-gene edge type
- [meta_path_index]tm.gz - the transpose of [meta_path_index].gz, used for indirect property-gene edge types
- key.txt - mapping between id for each meta path type,
There are many outputs that will be produced when MAPR_buildFeatures.py
is run. These will be stored in the ./outputs/
unless the -o
flag is used.
For each run of MAPR_buildFeatures.py (i.e. batch-000
), a set of directories will be generated for each sample. These directories will either correspond to the full
gene set of that sample or a subsampled part
of that sample's genes. In the case of 4-fold cross validation, there will be part-
subdirectories ending in 00
, 01
, 02
, and 03
. For this script, the outputs are:
- parameters.txt - a two column file that records the parameter and their values for this run
- For each
[sample]
's input gene set and eachfull-[sample]
orpart-[sample]-[XX]
partition:- known.txt - list of genes from that sample's input gene set that the algorithm is given for training. This will be the entire matched gene list for
full
partitions. - concealed.txt - list of genes from that sample's input gene list that the algorithm does not see in training. This will be 1/n of the matched gene list for
part
partitions with n-fold cross-validation - ignored.txt - list of genes from that sample's input gene set that did not match any genes in the network
- features_ZscoreSim.gz - a zipped genes by meta path type matrix where the row order matches genes.txt, the column order matches the meta path type index (key.txt), and the values reflect the transformed feature score for that gene for that meta path type
- known.txt - list of genes from that sample's input gene set that the algorithm is given for training. This will be the entire matched gene list for
There are many outputs that will be produced when MAPR_characterizeSet.py
is run for a particular model configuration, model_description
. These will be stored in the batch_directory
alongside the outputs of the previous steps. For the MAPR_characterizeSet.py
script, the outputs are:
- classifier-[model_description].txt - a description of the model and model parameters values
- results-AUC_mean_by_set.txt - for each
model_description
, the average AUC for each input gene set sample across the partition folds - results-AUC_every_fold.txt - for each cross validation partition and each model description, the AUC for the ranked list of left out genes on that input gene set partition.
- For each
[sample]
's input gene set and eachfull-[sample]
orpart-[sample]-[XX]
partition:- parameters-[model_description].txt - a description of the given model and model parameters values for the ensemble learned on that given partition
- ranked_all_genes-[model_description]_Avg.txt - rows for every gene in the network with columns for 1) weighted combination across models of z-scores of the predictions, 2) the gene name, and both columns 3 and 4) the gene rank for relatedness to input gene set
- ranked_genes-[model_description]_Avg.txt - rows for every gene in the network not in known.txt with columns for 1) weighted combination models of prediction scores normalized between 0 and 1, 2) the gene name, 3) the gene rank for relatedness to input gene set, and 4) the weighted combination of model prediction ranks
- scored_features-[model_description].txt - a meta-path-type by ensemble-model matrix with the [-1,1] normalized meta path type weights for each model. The first three columns show 1) the average meta path type weight across models, 2) the length of the meta path type, and 3) the name of the meta path type. The first row lists the combination weights for each model.
In the final case study of the MAPR paper, we examine the genes associated with epithelial-to-mesenchymal transition (EMT). The network and sample gene sets associated with this analysis are available to download as a zipped archive from here.
This work was supported by the National Institutes of Health [grant 1U54GM114838 awarded by NIGMS] through funds provided by the trans-NIH Big Data to Knowledge (BD2K) initiative.