`featurize` huge datasets #396

frostedoyster · 2024-12-24T10:23:12Z

At the moment, it seems that PCA requires the potentially very large n_structures x n_features feature matrix as an argument. This will not fit in memory for very large datasets.
Perhaps it would be beneficial to design a custom PCA class that allows for the accumulation of a n_features x n_features covariance matrix, which is manageable and can be diagonalized once all structures have been processed. In this way, the exploration of potentially huge datasets should become possible even on ordinary laptops, potentially taking advantage of batched evaluation (and a few hours of runtime)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`featurize` huge datasets #396

`featurize` huge datasets #396

frostedoyster commented Dec 24, 2024

featurize huge datasets #396

featurize huge datasets #396

Comments

frostedoyster commented Dec 24, 2024

`featurize` huge datasets #396

`featurize` huge datasets #396