diff --git a/data_integration_and_mining_with_graphs-dynamic_graphs.html b/data_integration_and_mining_with_graphs-dynamic_graphs.html new file mode 100644 index 0000000..7da1657 --- /dev/null +++ b/data_integration_and_mining_with_graphs-dynamic_graphs.html @@ -0,0 +1,216 @@ + + +- no title specified

EMBL Centre for Network Analysis workshop 9-11 December 2015

 

 

Data integration and mining with graphs:

 

Tensor factorizations for dynamic graphs

 

 

Jean-Karim Hériché

Ellenberg lab

Cell Biology and Biophysics unit

 

 

 

 

 

 

 

Summary

 

 As gene networks become available for multiple time points and experimental conditions, there's a need for principled ways of detecting patterns and their evolution in such dynamic graphs. We will introduce a tensor representation of dynamic graphs and tensor factorization methods as ways of investigating patterns in these graphs.

 

 

Motivation

 Many biological processes are driven by systems that evolve over time. New technologies make the dynamics of such systems more accessible with increasingly higher time resolution. In particular, data on gene networks can be acquired for multiple time points and experimental conditions. Detecting patterns and how they evolve over time and/or across experimental conditions in such multidimensional data is becoming a common task. For example, one may have performed AP-MS experiments of several proteins at different cell cycle or developmental stages and be interested in identifying protein complexes and how they change between time points.

 

Dynamic graphs

 - What is a dynamic graph ?

        A dynamic graph can be seen as multiple instances of a graph whose edges change between two instances. This definition includes time-varying graphs whose edges change as a function of time but also multigraphs (graphs in which nodes are connected with edges of different types) if you segregate each edge type into a separate instance of the graph. Examples of dynamic graphs are protein interaction networks and gene co-expression networks obtained at different cell cycle or developmental stages or under different experimental conditions (e.g. control vs treatment vs disease). With this definition, a dynamic graph can therefore be represented as a series of adjacency matrices.

 

 - Different approaches to mining dynamic graphs

        It is common to treat dynamic graphs in one of three ways:

 

All the above treat time or experimental conditions separately from the nodes. A global approach treating time/conditions and nodes simultaneously would better capture evolving patterns. One of the first such approach did so by extending the modularity coefficient [1] to multiple instances of a graph [2,3] by defining links between graph instances. Here, we'll introduce another global approach which treats genes, time and conditions as multiple dimensions of the data.

 

Tensors

 - What is a tensor ?

  There are different definitions of tensors used in different fields but they all come down to the same concept that tensors are a generalization of scalars, vectors and matrices: a scalar is a rank zero tensor, a vector is a rank one tensor, a matrix a rank two tensor... Therefore for most practical purposes, a tensor is a multidimensional array. It's easy to see that a dynamic graph represented as a series of adjacency matrices is a three-dimensional tensor.

 

 - Terminology and notation:

 Although the notation and terminology are not fully standardized, many papers now follow the conventions in [4] which we also adopt here. A tensor will be denoted using underlined bold face e.g. X to distinguish it from the matrix X.

 The order of a tensor is its number of dimensions also called modes (or sometimes ways). The equivalent of matrix rows and columns are fibers and when used as vectors, they are assumed to be column vectors. Slices are 2D sections of a tensor. See Figure 1 for a graphical representation of these terms for third-order tensors.

Figure 1: Tensor terminology
 
 

 - Working with tensors:

 Most of the tasks we're interested in rely on transforming tensors into matrices using a process called matricization or unfolding. Matricization consists in rearranging the elements of the tensor into a matrix. The most commonly used way of doing this is called mode-n matricization which takes the mode-n fibers of the tensors and arranges them as columns of a matrix (Figure 2). The mode-n unfolding of Z is denoted Z(n). The order of columns is not important as long as it is consistent across calculations. Sometimes it can be useful to represent a tensor as a vector, in a process called vectorization. Ordering of the elements in the vector is also not important as long as it is consistent.

 Although tensors can be multiplied together, we will only need the mode-n multiplication of a tensor with a matrix in which the mode-n fibers of the tensor are multiplied by the matrix. This is written Z = Y xn V and can be expressed in matricized form: Z(n) = VY(n). See Figure 3 for an intuitive graphical example.

        Working with matricized tensors also makes use of various matrix products: the Kronecker product, the Khatri-Rao product and the Hadamard products. See paragraphs 1 and 2 of [5] and chapter 1.4 of [6] for a more detailed introduction to tensor algebra.

Figure 2: Matricization of a tensor
 
 
Figure 3: Mode-n multiplication with a matrix
 
 

Graph nodes clustering by matrix factorization

        Nodes of static graphs can be clustered by applying clustering methods to the adjacency matrix. Among these methods, matrix factorization approaches have already been successfully applied to some computational biology questions (see [7] for a review on non-negative matrix factorization (NMF) in computational biology, [8,9] for examples of spectral clustering). Spectral clustering and NMF solve a problem of the form: A = W * H + E. A more general form of factorization is the singular value decomposition (SVD): A = U * S *  VT where columns of U and VT are constrained to be orthogonal. SVD corresponds to the decomposition of the matrix into the sum of outer products of vectors (Figure 4)

Figure 4: Singular value decomposition
 
 

Tensor factorization

 Tensor factorization can be seen as a natural extension of SVD to an arbitrary number of dimensions (Figure 5).

Figure 5: A higher order SVD
 
The general form is the Tucker decomposition which factorizes a tensor into a small core tensor and a set of matrices (See Figure 6). See [5] for a general introduction and [6] for non-negative factorizations.
 

        The Tucker decomposition can be written in terms of matrix unfolding of the tensor X (of size I x J x K):

X(1)AG(1)(CB)T

X(2)BG(2)(CA)T

X(3)CG(3)(BA)T

where ⊗ is the Kronecker product and A, B and C are factor matrices of size I x R, J x S and K x T respectively where R,S and T are the numbers of factors desired for each dimension. The factor matrices are usually constrained to have orthogonal columns.

Like PCA and other matrix decompositions, the Tucker decomposition is not unique as the core can be modified without affecting the quality of the solution as long as the inverse modification is applied to the matrices. Restricting the core tensor to a diagonal gives the CP (or PARAFAC) decomposition (Figure 7, which is equivalent to Figure 5) in which the same R factors are shared across the matrices. The CP decomposition can also be expressed in terms of matrices:

X(1)A(C B)T  

X(2)B(C A)T

X(3)C(B A)T

where ⊙ is the Khatri-Rao product. A form of non-negative tensor factorisation is obtained by constraining A, B and C to have non-negative entries.

Figure 6: Tucker decomposition of a third order tensor
 
 
Figure 7: The CP decomposition
 
 

Non-negative tensor factorisation of dynamic graphs

 This can be thought of as an extension of spectral clustering to more than one adjacency matrix. Our dynamic graph is represented by a third order tensor G of genes x genes x time points (or experimental conditions) in which each frontal slice is the genes x genes adjacency matrix of a weighted undirected graph. The edge weights are also assumed to be non-negative (≥ 0). Using a non-negative CP decomposition, we can factorize G into non-negative factor matrices A, B and C. Because the graph is undirected, the adjacency matrices are symmetric which leads to A = B. A is a genes x factors matrix and C is a factors x time points matrix. Therefore, if we think of the factors as representing clusters, elements of A can be interpreted as gene cluster memberships and elements of C as measuring the temporal activity of the clusters.

 

Tools

- MATLAB toolboxes:  tensor, nway (compatible with Octave)

- R packages*: PTak, ThreeWay, rTensor

- perl:  Algorithms::Cube (https://github.com/jkh1/Perlstuff)

- python*: scikit-tensor (https://github.com/mnick/scikit-tensor)

 

* non-negativity constraint is not available in R and python packages (December 2015).

 

Example in R

## Load required package

library(rTensor)

 

## Where the data is

## This is a simulated dynamic graph with 30 nodes and 6 time points

dataDir <- "~/Documents/Bio-IT/courses/Data_integration_and_mining_with_graphs/dynamic_graph"

 

## Get list of adjacency matrices

## Each matrix is in a csv file

fileList <- dir(path=dataDir,pattern = ".csv")

 

## Read all adjacency matrices into an array

A <- array(as.numeric(NA),dim=c(30,30,6)) # There are 6 matrices of size 30 x 30

for (i in 1:length(fileList)){

    A[,,i] <- as.matrix(read.delim(file.path(dataDir,fileList[i]), sep = ';', header=TRUE, row.names=1))

}

 

## Convert list of adjacency matrices to a tensor

G <- as.tensor(A)

 

## Perform CP factorization

cpG <- cp(G,num_components = 4)

## View the factor matrices

## Note that the first two are identical (up to numerical accuracy)

## due to the fact that the graph adjacency matrices are symmetric

cpG$U

 

 

 

Example in perl

#!/usr/bin/perl

use strict;

use warnings;

use Algorithms::Matrix;

use Algorithms::Cube;

 

# Where the data is

my $dataDir = "$ENV{'HOME'}/Documents/Bio-IT/courses/Data_integration_and_mining_with_graphs/dynamic_graph";

 

# Number of time points

my $T = 6;

 

# Read dynamic graph from files as series of adjacency matrices

my @G;

foreach my $t(0..$T-1) {

  my ($rowLabels,$colLabels);

  ($G[$t], $rowLabels, $colLabels) =

 Algorithms::Matrix->load_matrix("$dataDir/A$t.csv",";",1,1);

}

# Create Cube (i.e. third order tensor) object from matrices

my $G = Algorithms::Cube->new(@G);

 

# CP factorization with non-negativity constraint

my $r = 4; # Number of components

my ($A,$B,$C,@norms) = $G->nncp($r,symmetry=>1,norms=>1);

 

 

References:

1. Newman MEJ, Girvan M (2004) Finding and evaluating community structure in networks. Phys. Rev. E 69, 026113.

2. Mucha PJ, Richardson T, Macon K, Porter MA, Onnela JP (2010) Community structure in timedependent, multiscale, and multiplex networks. Science 328: 876–878.

3. Bassett DS, Porter MA, Wymbs NF, Grafton ST, Carlson JM, et al. (2013) Robust detection of dynamic community structure in networks. Chaos 23: 013142.

4. Kiers HAL (2000) http://www.wiley.com/legacy/wileychi/chemometrics/582_ftp.pdf. J. Chemometrics 14: 105-122.

5. Kolda TG, Bader BW (2009) Tensor decompositions and applications. SIAM Rev., 51(3), 455–500.

6. Cichocki A, Zdunek R, Phan AH, Amari SI (2009) Nonnegative matrix and tensor factorizations: applications to exploratory multi-way data analysis and blind source separation. John Wiley & Sons.

7. Devarajan K (2008) Nonnegative matrix factorization: an analytical and interpretive tool in computational biology. PLoS Comput Biol 4(7):e1000029.

8. Kluger Y, Basri R, Chang JT, et al. (2003) Spectral biclustering of microarray data: coclustering genes and conditions. Genome Res. 13(4):703–16.

9. Imangaliyev S, Keijser B, Crielaard W, Tsivtsivadze E (2015) Personalized microbial network inference via co-regularized spectral clustering. Methods 83:28-35.

diff --git a/data_integration_and_mining_with_graphs-kernels.html b/data_integration_and_mining_with_graphs-kernels.html new file mode 100644 index 0000000..f2f68d8 --- /dev/null +++ b/data_integration_and_mining_with_graphs-kernels.html @@ -0,0 +1,214 @@ + + +- no title specified

EMBL Centre for Network Analysis workshop 9-11 December 2015

 

 

Data integration and mining with graphs:

 

Kernels on graph nodes

 

 

Jean-Karim Hériché

Ellenberg lab

Cell Biology and Biophysics unit

 

 

 

 

 

 

 

Summary

 

 Biological data sets can be represented as graphs for the purpose of data integration and mining. We will introduce kernels on graph nodes as measures of similarity between genes, show how to compute some kernels for various biological data sources and how they can be used for data integration. We will apply the kernels to the task of candidate genes selection.

 

 

   

 

 


Motivation

 

 While large, nearly genome-wide studies are now possible, they remain beyond the reach of most laboratories as they require a large investment in reagents, equipments and data management. In most cases, locally available resources only allow a few dozen to a few hundred genes to be studied, with candidate genes usually selected either from a gene family (e.g. kinases) or after extensive literature and database searches. In the latter case, candidate genes are identified by association to genes already known to be implicated in the process under consideration. Given that links between genes can be indirect and that there are multiple sources of data, this selection process has become a difficult task. In this context, efficient automatic candidate gene selection has the potential to maximize returns on investment in experiments.

        Selecting good candidate genes means finding genes that have a high probability of being part of or related to the process one wants to study. If this process is defined in terms of a few genes known to participate in that process, then good candidate genes would be those that are functionally linked to these known genes. Compared to large scale, systematic experiments, finding new candidates based on network context analysis has the added advantage to provide hypothesis about likely underlying molecular interaction mechanisms involving the predicted targets.

        Many methods using the "guilt by association" principle have been developed for predicting gene function and applied to model organisms (for review see [1]). Among these, kernel-based methods have gained in popularity [2,3]. In this context, despite representing a natural measure of similarity between genes, kernels on graph nodes have received little attention.

 

Kernels

        Kernels and kernel methods are well described and explained in [4] which is suggested reading for anyone interested in the kernel framework.

 

- What is a kernel ?

        A kernel is a function that maps data points (e.g. genes) into some high dimensional space (called feature space) and computes the dot product of the corresponding vectors in that space. That is, for a mapping function Φ and 2 genes x and y, the kernel function K computes K(x,y) = ⟨Φ(x),Φ(y)⟩. A kernel matrix contains the evaluation of a kernel function for all pairs of data points under consideration. Note that we often use the term kernel to refer to the kernel matrix instead of the function.

 

-  Useful properties of kernels:

Proof for these properties can be found in chapter 2 and 3 of [4].

 

        - Why use kernels ?

        Many algorithms work using only the dot products between input vectors e.g. PCA, LDA, k-means and other clustering algorithms, support vector machine... The kernel trick often mentioned in the literature refers to using kernel matrices to provide these dot products. This provides modularity as any kernel can work with any dot product-based algorithm. Kernels can be used to map complex patterns into a feature space where they appear linear allowing them to be efficiently identified using common algorithms. The last property listed above means that a combination of similarity matrices can still be interpreted as a matrix of similarities if the combined matrices are valid kernels. This property makes kernels particularly attractive for data integration because it provides a principled way of combining information from different sources into one similarity matrix.

 

Measures of graph nodes similarity

        A good measure of similarity between nodes of a graph should satisfy the guilt-by-association intuition that two nodes are more similar to each other when they are connected to similar nodes. Our intuition also tells us that nodes are more similar when there are more connections between them and the paths connecting them are shorter. Therefore, to measure similarity between nodes, we want to take into account how they are connected. Some simple measures would involve counting the number of hops or summing the weights of the hops between two nodes but it's easy to find examples where these measures don't capture meaningful similarity.

 

        One way to analyse connectivity in a graph is to select a starting node then move to a random neighbour of this node then move to a random neighbour of that node and so on for an infinite number of steps. The sequence of randomly visited nodes in this process is called a random walk on the graph and can be viewed as a Markov chain. A short simulation can show that a random walk hits closer, more connected nodes more often than more distant, less connected ones or, seen another way, it takes less time to reach a node from closer, more connected nodes than from more distant, less connected nodes.

 

        The expected time (= number of steps) it takes in a random walk to reach a node starting from a different node is called hitting time. Clearly, although the hitting time is a distance (similar nodes have a small hitting time and dissimilar nodes have a large hitting time), it captures the notion of similarity we outlined above. However, it is not symmetric as H(i,j) is generally not equal to H(j,i) but it can be used to derive the commute time between i and j which measures how long it takes to go from i to j and back to i: CT(i,j) =  H(i,j)+H(j,i) = CT(j,i) and does capture the same notion of similarity. Other similarity measures can be derived from the random walk. One is called random forest (RF) because it arises from the enumeration of the spanning rooted forests in the graph and measures the relative "forest-accessibility" between nodes [5]) and RF(i,j) corresponds to the probability of reaching node j in a random number of steps starting from node i [6]. Finally, if instead of jumping from node to node we have the random walk move in a continuous fashion along the edges, we simulate a diffusion process. Measuring similarity in a diffusion process involves enumerating all paths between two nodes while penalizing the longer paths [4].

 

Computing kernels on graph nodes

        Each data set is viewed as the adjacency matrix A of a weighted undirected graph and a value Aij of 0 indicates no edge between i and j. In the following, D denotes the diagonal degree matrix, I the identity matrix, L the graph Laplacian (L= D-A). We will compute the following kernels:

 

 - kernelized adjacency matrix:

 Our data sources already represent a measure of similarity between genes. To ensure that the matrix is positive semi-definite and therefore a valid kernel,  we can shift its eigenvalues. This is accomplished by adding a sufficiently large constant to the diagonal of each matrix. Here we use the absolute value of the smallest eigenvalue.

        KA = A+λI with  λ= absolute value of smallest eigenvalue of A

 

 - Commute time kernel:

 The commute time between two nodes can be computed using L+, the pseudo-inverse of the Laplacian [7]. However, the commute time itself is a distance measure but it can be shown that L+ is the covariance matrix associated with the nodes in a feature space where the nodes are separated by their commute time [7]. Therefore (also because L is symmetric, positive and semi-definite) L+ is symmetric, positive and semi-definite and a valid kernel. It is called the commute time kernel. It also has an interpretation in terms of electrical network as the commute time is equal to the effective resistance between two nodes [8]. Note that for very large graphs, the commute time distance becomes meaningless [9] but this can be dealt with.

        KCT = L+ (pseudo-inverse of L)

 

 - Random forest kernel:

 This kernel arises from the enumeration of the spanning rooted forests in the graph and measures the relative "forest-accessibility" between nodes and has an interpretation in terms of probabilities of reaching a node in a random walk with random number of steps. Its derivation is explained in [5,6].

        KRF = (I+L)-1

 

 - von Neumann diffusion kernel:

 This kernel enumerates all paths between 2 nodes while penalizing the longer paths [4]:

        KVN= ΣkαkAk = (I-γA)-1

        The penalizing factor is αk (k being the length of the path) and the kernel is defined for 0<γ−1 with ρ being the spectral radius of A. We will use 2 values of γ: γ = 0.99ρ−1 and γ = 0.000−1 which correspond respectively to high and low values of the admissible range for γ. Note that in a supervised setting with enough training data, one can learn a value for γ from the data.

 

        - Degree-based similarity:

 We also compute a similarity matrix based only on node degree as in [10]:

        KDB= BBT where B=De and e is the all-one vector (eT=(1,1,...,1)).

KDB(i,j) is simply the product of the degree of node i by the degree of node j. Note that this is a valid kernel by construction.

 

        - Normalization:

        Genes with multiple functions or that are more studied tend to have more links to other genes. To compensate for this effect, each kernel is normalized to the range [-1,1] by diag(K)-1/2Kdiag(K)-1/2 (which corresponds to computing the cosine of the vectors in the feature space of the kernel). This also seems to alleviate the commute time issue with large graphs [9].

 

Other kernels and similarity measures can be found in [11].

 

Application to finding functionally-related genes

        The approach we will follow is that of semi-supervised learning by label propagation. We assume that similar genes have similar function and we can propagate function from genes with known function to similar genes for which the function is unknown. The data processing pipeline is shown in Figure 1.

 
 

Practical

For this session, we're going to work with the text mining data (TM) using R.

First we need to load the libraries we need:

## Load required packages

library(igraph) # to manipulate graphs

library(MASS) # for pseudoinverse function ginv()

library(rARPACK) # for eigs(), to compute limited number of eigenvalues

We're also going to put the data directory in a variable so that we don't need to type it in each time:

## Where the data is

dataDir <- "~/Documents/Bio-IT/courses/Data_integration_and_mining_with_graphs"

Now we read the TM data from the file. The file contains a tab-delimited list of gene pairs with their similarities.

## Read data as list of tab-delimited pairwise similarities

## This is equivalent to a list of weighted graph edges

TM <- read.delim(file.path(dataDir,"TM.txt"), header=FALSE)

We name the variables for clarity and convenience (it's easier to understand what the variable weight means than what V3 means):

## Name variables for convenience (otherwise, by default, weights are in variable named V3)

colnames(TM) <- c("geneA","geneB","weight")

We convert the data to a graph and get its adjacency matrix:

## Form undirected graph from edge list

G <- graph.data.frame(TM,directed=FALSE)

## Get adjacency matrix

A <- as_adjacency_matrix(G,type="both",names=TRUE,sparse=FALSE,attr="weight")

 

We're now ready to play with the data:

1 - Is the original TM matrix a valid kernel ? If not how would you turn it into a kernel ?

As we've seen, a sufficient condition is that the matrix is symmetric and positive semi-definite:

## Check if A is a valid kernel i.e. symmetric positive semi-definite

isSymmetric(A) # is it symmetric ?

Eig <- eigs(A,1,which="SR") # compute the smallest eigenvalue and corresponding eigenvector

Eig$values # smallest eigenvalue

The TM adjacency matrix is symmetric but not positive because the smallest eigenvalue is negative. To make it positive semi-definite, we simply add the absolute value of the smallest eigenvalue to the diagonal:

## Make symmetric adjacency matrix a valid kernel

I <- diag(nrow(A)) # Identity matrix of the same size as A

A <- A + I*abs(Eig$values) # Now smallest eigenvalue of A is 0, A is a valid kernel

 

2- Compute some kernels for the data including the degree-based similarity and normalize the matrices:

Let's start with the commute time kernel:

## Get Laplacian matrix

L <- laplacian_matrix(G,normalized=FALSE,sparse=FALSE) # weights are set to the weight attribute by default

## Commute time kernel

CT <- ginv(L) # Matrix pseudo-inverse

## Normalize kernel

D <- diag(1./sqrt(diag(CT)))

CT <- D %*% CT %*% D

Now for the degree-based kernel:

## Degree-based kernel

B <- as.matrix(rowSums(A)) # column vector

DB <- B %*% t(B)

## Normalize kernel

D <- diag(1./sqrt(diag(DB)))

DB <- D %*% DB %*% D

And the diffusion kernel with two different values of its parameter:

## von Neumann diffusion kernel

## using different parameter values between 0 and 1/rho where rho = spectral radius of A

rhoA <- eigs(A,1,which="LM")

gamma <- 0.99/rhoA$values # large value, make sure gamma < 1/rho so that matrix is invertible

VNmax <- solve(I - gamma*A) # Matrix inverse

## Normalize kernel

D <- diag(1./sqrt(diag(VNmax)))

VNmax <- D %*% VNmax %*% D

## VN kernel for a small parameter value

gamma <- 0.0001/rhoA$values # small value

VNmin <- solve(I - gamma*A)

D <- diag(1./sqrt(diag(VNmin)))

VNmin <- D %*% VNmin %*% D

 

3- Using different kernels, retrieve the top 20 genes related to the ANAPC, a random list of genes and a mixed list of ANAPC and DNA replication genes.

First we name the kernel rows and columns with the corresponding gene symbols to allow convenient access to similarities by gene symbol:

## Name kernel rows and columns for later convenience

colnames(CT) <- colnames(L)

rownames(CT) <- rownames(L)

colnames(VNmax) <- colnames(L)

rownames(VNmax) <- rownames(L)

colnames(VNmin) <- colnames(L)

rownames(VNmin) <- rownames(L)

rownames(DB) <- rownames(L)

Now we read the list of ANAPC genes:

## Read list of query genes

query_ANAPC <- read.table(file.path(dataDir,"query_ANAPC.txt"), quote="\"", stringsAsFactors=FALSE) # stringsAsFactors=FALSE to keep gene symbols as strings

Query the CT kernel by extracting the columns corresponding to the query genes:

## Query CT kernel

CT_ANAPC<-CT[,query_ANAPC$V1]

If a query gene is not in the data, we get an error, so it is safer to go by column indices as in the following which would also work if we hadn't named the columns or hadn't read the gene symbols as strings with stringsAsFactors=FALSE:

index <- colnames(CT) %in% query_ANAPC$V1

CT_ANAPC<-CT[,index]

Again, for convenience, we name the rows with the corresponding gene symbols:

## Name the rows

rownames(CT_ANAPC)<-rownames(CT)

We then sum the rows to get the sum of similarity of each gene to the query genes, sort from highest to lowest similarity and show the top 20:

## Sum the rows

CT_ANAPC <- rowSums(CT_ANAPC)

## Sort by decreasing values

CT_ANAPC <- sort(CT_ANAPC,decreasing=TRUE)

## Show the top 20

CT_ANAPC[1:20]

We can now proceed in the same way for the other kernels and query lists.

 

4- Inspect the results. Are they what you expected ?

We can see that when the query consists of known functionally-related genes, the genes returned at the top of the list are other genes known to be involved in the same process as the query genes. When using random or mixed queries, the returned genes don't seem to make sense.

 

5- How would you define the most relevant genes ?

Plot the ordered similarities to the query, most of the time the top couple hundreds are enough.

plot(CT_ANAPC[1:200], yaxt=”n”) # No ticks on y axis

axis(side = 2, at = seq(0,max(CT_ANAPC)+1,0.1)) # Add ticks every 0.1 on y axis

Notice the elbow followed by a long flat tail of low values. Similarity values on the left of the elbow can be considered significantly higher than those from a random selection of genes.

 

6- How would you find which is the best kernel for this data set ?

First, we need lists of known functionally-related genes. These could be obtained from various databases e.g. Panther, KEGG, Reactome... Note that the choice of reference database affects the definition of functional relationship.

Using these lists, we can take a cross-validation approach: For each list, take some genes as query and find the rank of the other genes from the list. We can define sensitivity as the fraction of these target genes that are in the top N%. If we assume that random genes are unlikely to be functionally related, we can estimate the false positive rate by measuring the fraction of target genes in the top N% using random lists of genes matching those from the database. Plotting the average sensitivity and false positive rate across all gene lists for different rank thresholds allows to compare the different kernels. Usually the false positive rate is acceptably low for most kernels so the main difference between them comes from the difference in sensitivity.

The same approach can be used to see that data integration by combining kernels gives better results than individual kernels.

 

 

 

 

 

References

1. Wang PI, Marcotte EM. It's the machine that matters: Predicting gene function and phenotype from protein networks. J Proteomics 2010;73:2277-89.

2. De Bie T, Tranchevent LC, van Oeffelen LM et al.. Kernel-based data fusion for gene prioritization. Bioinformatics 2007;23:i125-32.
3. Qi Y, Suhail Y, Lin Y et al..
Finding friends and enemies in an enemies-only network: A graph diffusion kernel for predicting novel genetic interactions and co-complex membership from yeast genetic interactions. Genome Research 2008;18:1991-2004.
4. Shawe-Taylor J, Christianini N. Kernel methods for pattern analysis. Cambridge: Cambridge University Press, 2004.
5. Chebotarev P, Shamis E. The Matrix-Forest Theorem and Measuring Relations in Small Social Groups. Automation and Remote Control 1997;58:1505-1514.
6. Chebotarev P. Spanning forests and the golden ratio. Discrete Applied Mathematics 2008;156: 813-821.
7. Fouss F, Pirotte A, Renders JM et al. Random-Walk Computation of Similarities between Nodes of a Graph with Application to Collaborative Recommendation. IEEE Trans Knowl Data Eng 2007;19:355-369.
8. Qiu HJ, Hancock ER. Clustering and embedding using commute times. IEEE Trans Pattern Anal Mach Intell 2007;29:1873–1890.
9. von Luxburg U, Radl A, Hein M. Getting lost in space: large sample analysis of the commute distance. In Proceedings of the 23th neural information processing systems conference. NIPS 2010 (pp. 2622–2630).
10. Gillis J, Pavlidis P. The Impact of Multifunctional Genes on "Guilt by Association" Analysis. PLoS One 2011;6:e17258.
11. Fouss F, Francoisse K, Yen Y et al. An experimental investigation of kernels on graphs for collaborative recommendation and semisupervised classification. Neural Networks 2012; 31:53–72.

diff --git a/index.html b/index.html index 20284de..d81e6e9 100644 --- a/index.html +++ b/index.html @@ -48,7 +48,7 @@
  • Lees JG, Hériché JK, Morilla I, Fernández JM, Adler P, Krallinger M, Vilo J, Valencia A, Ellenberg J, Ranea JA, Orengo C. FUN-L: Gene prioritization for RNAi screens. Bioinformatics. 2015 Feb 8. pii: btv073. PubMed PMID: 25667547
  • -
  • Aleksic J, Alexa A, Attwood TK, Chue Hong N, Dahlö M, Davey R, Dinkel H, Förstner KU, Grigorov I, Hériché JK Lahti L, MacLean D, Markie ML, Molloy J, Schneider MV, Scott C, Smith-Unna R, Vieira BM; as part of the AllBio: Open Science & Reproducibility Best Practice Workshop. An open science peer review oath. Version 2. F1000Res. 2014 Nov 12 [revised 2015 Jan 9];3:271. doi: 10.12688/f1000research.5686.2. eCollection 2014. PubMed PMID: 25653839; PubMed Central PMCID: PMC4304228.
  • +
  • Aleksic J, Alexa A, Attwood TK, Chue Hong N, Dahlö M, Davey R, Dinkel H, Förstner KU, Grigorov I, Hériché JK, Lahti L, MacLean D, Markie ML, Molloy J, Schneider MV, Scott C, Smith-Unna R, Vieira BM; as part of the AllBio: Open Science & Reproducibility Best Practice Workshop. An open science peer review oath. Version 2. F1000Res. 2014 Nov 12 [revised 2015 Jan 9];3:271. doi: 10.12688/f1000research.5686.2. eCollection 2014. PubMed PMID: 25653839; PubMed Central PMCID: PMC4304228.
  • Patwardhan A, Ashton A, Brandt R, Butcher S, Carzaniga R, Chiu W, Collinson L, Doux P, Duke E, Ellisman MH, Franken E, Grünewald K, Hériché JK, Koster A, Kühlbrandt W, Lagerstedt I, Larabell C, Lawson CL, Saibil HR, Sanz-García E, Subramaniam S, Verkade P, Swedlow JR, Kleywegt GJ. A 3D cellular context for the macromolecular world. Nat Struct Mol Biol. 2014 Oct 7;21(10):841-845. doi: 10.1038/nsmb.2897. PubMedID: 25289590
  • @@ -240,6 +240,14 @@

    +
    +

    Notes from the EMBL workshop on data integration and mining with graphs: +

    +

    +

    I have built and maintain the following projects web sites and databases:

             @@ -247,7 +255,7 @@

    -

    I have been/still am involved in the following projects:

    +

    I have been involved in the following projects: