Replies: 1 comment
-
Hi @sachinruk, we are applying PCA directly on the static token embeddings that you get by forward passing the vocabulary. The relevant line can be found here. So essentially, we first forward pass all the tokens, which gives you (vocab_size, dim_size) embeddings (where dim_size is the dimensionality of the model you are distilling), and then we apply PCA on those embeddings, which gives you (vocab_size, pca_dims) output embeddings. Hope that answers your question! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello, I've seen that using PCA is one of the biggest advantages. Just wondering where PCA is applied exactly?
I initially thought that you were applying PCA over the output dimension of a decently large dataset, but seems like this is not the case (since you seem to be able to apply this to arbitrary models). The only other place I can think of is the token embeddings itself, but then the model dimensions mis-match.
TIA
Beta Was this translation helpful? Give feedback.
All reactions