Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

featurize huge datasets #396

Open
frostedoyster opened this issue Dec 24, 2024 · 0 comments
Open

featurize huge datasets #396

frostedoyster opened this issue Dec 24, 2024 · 0 comments

Comments

@frostedoyster
Copy link

At the moment, it seems that PCA requires the potentially very large n_structures x n_features feature matrix as an argument. This will not fit in memory for very large datasets.
Perhaps it would be beneficial to design a custom PCA class that allows for the accumulation of a n_features x n_features covariance matrix, which is manageable and can be diagonalized once all structures have been processed. In this way, the exploration of potentially huge datasets should become possible even on ordinary laptops, potentially taking advantage of batched evaluation (and a few hours of runtime)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant