Experimental!!!
the R package OverICA
performs overcomplete indepedent component analysis (ICA), meaning it estimates the relation between latent variables andobserved variables in the data, allowing there to be more latent then observed variables. We observe p variabes in matrix
this code estimates A by levraging the differences betweee the observed generalized covariance matrices and generalized means in the data and the model implied generalized covariance matrices and generalized means. This is based on ideas developed by Podosinnikova et al. (2019).
We assume:
- variables
$x_i$ are non-gaussian (skewed, kurtotic sparse etc) - variables
$x_i$ are uncorrelated - a single layer neural network applied to a gaussian variable can approximate the (moments of) the uncorrelated latent variables
Unlike Podosinnikova et al. I use backpropagation to estimate the parameters. Based on ideas in Ding et al. (2019) I define a generative neural network for each of the k latent variables (a multi-layer perceptron), and a matrix A that mixed these variables into p observed pseudo variables. I train the model to approzimate the the generalized covariacne matrices of the obersed data. Unlike Deng et al. I explicitly penalize the loss to ensure the latent variables remain uncorrelated.
the key gain over other overcomplete ICA techniques is that we only use 2nd order statistics, no skewness and kurtosis related math needed! Which is great because higher order moments like skewness and kurtosis, or higher order cumulants usually means high dimensional matrices and slow opimization.
Warning:
Inferences is aproximate! Here is an example of two runs estimating 8 latent variables given 5 observed variables, same data goes into the model twice. Simply because we sample a random matrix of 96*5 points at which to evaluate the model ("t-values" see below) and the t-values in one of these runs were likely less the ideal points of the multiviate density to evalue the modle one run neatly recapitulates the ground truth, the other is noisier.
Good run | Bad run |
---|---|
In this case the better solution had a lower loss, which obviously is a good indicator, but this isnt always the case as the loss can depend on thye speciific set of t-values that was sampled and we can overfit. In an emperical application you'd never know which is closer to the truth and without further exploration it would be hard to favor one over the other. It's essential to try a whole host of optimizer settings to figure out which gives you lower loss/better convergence. Once you are satisfied with the optimizer settings you can try a few independent runs and compare results.
I have implemented a num_runs
argument in overica()
which lets you run the same model a number of times and keeps all runs, you can then use the best run (in terms of loss) or compute the median result across runs.
Before proceeding, ensure that you have the necessary dependencies installed. You can install the OverICA
package from your local directory or repository once it's built. Additionally, install the required packages if you haven't already:
# Install necessary packages
install.packages(c("torch", "clue", "MASS", "devtools"))
# Install OverICA package (replace 'path/to/package' with your actual path)
devtools::install_github("MichelNivard/OverICA")
library(OverICA)
Here is a generic call to the overica()
function the data in a n by p matrix (n observations and p variables), k is the number of components, pick k below
be aware this function fits a custom neural network (see scientific background below) undertraining and relying on suboptimal or generic setings will give you bad results, but no warnings! You will always extract some results but you won't know of they are bad!
# Call the estimation function
result <- overica(
data = data,
k = k,
n_batch = 4096,
num_t_vals = 12,
tbound = 0.2,
lambda = 0,
sigma = 3,
hidden_size = 10,
use_adam = TRUE,
use_lbfgs = FALSE,
adam_epochs = 8000,
adam_lr = 0.1,
lbfgs_epochs = 45,
lr_decay = 0.999
)
In statistical analysis and parameter estimation, particularly within the OverICA
package, it's crucial to comprehend the relationship between the covariance matrix and the Cumulant Generating Function (CGF). This understanding underpins accurate parameter estimation and model fitting.
Cumulants are statistical measures that provide insights into the shape and structure of a probability distribution. Unlike moments (e.g., mean, variance), cumulants have properties that make them particularly useful for analyzing dependencies and higher-order interactions.
-
First Cumulant (
$\kappa_1$ ): Mean ($\mu$ ) -
Second Cumulant (
$\kappa_2$ ): Variance ($\sigma^2$ ) for univariate distributions; Covariance ($\Sigma$ ) for multivariate distributions - Higher-Order Cumulants: Related to skewness, kurtosis, etc.
The Cumulant Generating Function (CGF) encapsulates all cumulants of a random variable or vector. For a random vector
where
-
Cumulants via Derivatives: The cumulants are obtained by taking partial derivatives of the CGF evaluated at
$\mathbf{t} = \mathbf{0}$ . -
Additivity for Independence: If
$\mathbf{X}$ and$\mathbf{Y}$ are independent, then$K_{\mathbf{X} + \mathbf{Y}}(\mathbf{t}) = K_{\mathbf{X}}(\mathbf{t}) + K_{\mathbf{Y}}(\mathbf{t})$ .
The covariance matrix
The cumulant generating function (
where (
The first derivative of the CGF (
where (
The second derivative of (
This is equivalent to the variance of ( X ) under the tilted distribution:
there is the special case
At the bottom of the page there is a full derivation of the relation betwene the CGF and the moments, which helped me greatly personally.
The cumulant generating function (CGF) for a pair of random variables (X) and (Y) is an extension of the univariate case. It encapsulates information about the joint moments of (X) and (Y). Here's a breakdown of the bivariate CGF and its derivatives:
For two random variables (X) and (Y), the bivariate cumulant generating function (
where: - (
First Derivatives of (
The partial derivatives of (
where (
Second Derivatives of (
The second-order partial derivatives of (
Variance of ( X ):
Variance of ( Y ):
Covariance between ( X ) and ( Y ):
Thus, the second derivatives of the bivariate CGF (
When (
Thus, the covariance matrix is formed by the second-order partial derivatives of the CGF evaluated at zero:
Each element
So we can define the covariances and means in terms of the cumulant generating fuction (EGF) evaluated at
These other evaluations of the CGF can be viewed as generalized covariance matrices:
Similarly the generalized mean the the first derivative of the CGF evaluated at
In OverICA
we evaluate the emperical CGF of the data at a number of points (
We base our inference on Ding et al. A generative adverserial network is a neural network that "learns" to approximate a data distribution. In our case the network will learn to mimic the generalized covariances of the actual data. The model is depicted in the diagram below:
We start with 1 sample of random normally distributed data vectors (
And the generalized covariances of the pseudo data
So in our moodel the free parameters are:
- a neural network for each latent variable
- the loadings matrix
$A$
Under the following constraint:
a. variables
b. optinally a sparsity constraint on
Mimimizing a function that is a sum of these terms over a number of values
$$||GenCov(\hat{y}) - GenCov(y)||_2$$ $$||GenMean(\hat{y}) - GenMean(y)||_2$$
We arent interested in learning specific generalized covariances of the data, so for each itteration of the optimizer we sample fresh values t and evaluate the emperical and model implied generalized covriances. This ensures we dont accidentally overfit to a specific poorly estimated set of emperical generalized covariances of the data.
- Sample
- Sample new values
$t$ - Compute the generalized covariances for the data at
$t$
- Forward Pass (Prediction):
- Input Data: Feed the input data into the model.
- Compute Output: The model processes the input through its layers to produce an output (prediction) of the generalized covariances at
$t$ .
- Loss Computation:
- Calculate Loss: Compute the difference between the predicted outputs and the actual outputs.
- Backward Pass (Backpropagation):
- Compute Gradients: Calculate how much each parameter (weight and bias) contributes to the loss using calculus (derivatives).
- Gradient Descent: Use these gradients to determine the direction to adjust the parameters to minimize the loss.
- Parameter Update:
- Adjust Parameters: Update the model's parameters slightly in the direction that reduces the loss.
- [https://github.com/dingchenwei/Likelihood-free_OICA] Likelihood Free Overcomplete ICA
- [https://github.com/gilgarmish/oica] Overcomplete ICA trough convex optimisation
Podosinnikova, A., Perry, A., Wein, A. S., Bach, F., d’Aspremont, A., & Sontag, D. (2019, April). Overcomplete independent component analysis via SDP. In The 22nd international conference on artificial intelligence and statistics (pp. 2583-2592). PMLR.
Ding, C., Gong, M., Zhang, K., & Tao, D. (2019). Likelihood-free overcomplete ICA and applications in causal discovery. Advances in neural information processing systems, 32.
let's go through the derivation step by step to understand how we get to the second derivative of the cumulant generating function
The cumulant generating function
where
To find
Using the chain rule, we get:
Now, differentiate
Since
Substitute the derivative back into the expression for $ K'(t) $:
This is the expected value of
Now, differentiate $ K'(t) $ to find $ K''(t) $:
Use the definition of
This can be rewritten as:
This is because:
and
The expression
Thus, the second derivative of the cumulant generating function