-
-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve SMART parameters standardization #708
Comments
❓ I have no idea what you're talking about 😆 I could look at the code, but am too lazy... Can you elaborate a bit more, maybe with an example? |
Long story short they are still some cases where the SMART method does not perform well, in comparison to refit and also to "classic". This appears mainly for interaction terms, and for interaction between a continuous and a factor variable. I have no idea how to improve on that, and I think that to address this we would need someone with a deep understanding on how model matrices are built for interactions terms, and how we can standardize in it the interaction term so that it reflects the interaction term of two standardized variables... That's more a longterm issue tho, if by any chance we meet someone with some understanding of the model matrices and formulas... |
In a nutshell, the goal here is to reconstruct the standardized model.matrix (the 2nd one below) with the original model matrix and the Mean/SD of each variable... df <- iris
dfZ <- parameters::standardize(iris)
head(model.matrix(~ df$Sepal.Length * df$Species))
#> (Intercept) df$Sepal.Length df$Speciesversicolor df$Speciesvirginica
#> 1 1 5.1 0 0
#> 2 1 4.9 0 0
#> 3 1 4.7 0 0
#> 4 1 4.6 0 0
#> 5 1 5.0 0 0
#> 6 1 5.4 0 0
#> df$Sepal.Length:df$Speciesversicolor df$Sepal.Length:df$Speciesvirginica
#> 1 0 0
#> 2 0 0
#> 3 0 0
#> 4 0 0
#> 5 0 0
#> 6 0 0
head(model.matrix(~ dfZ$Sepal.Length * dfZ$Species))
#> (Intercept) dfZ$Sepal.Length dfZ$Speciesversicolor dfZ$Speciesvirginica
#> 1 1 -0.8976739 0 0
#> 2 1 -1.1392005 0 0
#> 3 1 -1.3807271 0 0
#> 4 1 -1.5014904 0 0
#> 5 1 -1.0184372 0 0
#> 6 1 -0.5353840 0 0
#> dfZ$Sepal.Length:dfZ$Speciesversicolor
#> 1 0
#> 2 0
#> 3 0
#> 4 0
#> 5 0
#> 6 0
#> dfZ$Sepal.Length:dfZ$Speciesvirginica
#> 1 0
#> 2 0
#> 3 0
#> 4 0
#> 5 0
#> 6 0 Created on 2019-09-23 by the reprex package (v0.3.0) |
I believe one of the reasons for the issues with interactions stems out of the fact that as we know, a regression model "fixes" the other parameters at 0. This corresponds to the mean, when a standardized dataset is passed. Let's say we have the interaction between x * y. The coefficient corresponding to "x" is the coefficient of x when y = 0. This coefficient changes, as y changes (following the interaction coefficient). Now, if a standardized dataset is passed, it is normal that the effect of the "x" parameter is different, as it corresponds to the effect of "x" at the mean of "y" (which might not be the case of unstandardized data). Hence, my hint is that the posthoc standardization should somehow take the mean of the variables into account in the case of interactions. I have no idea how, though. To facilitate the exploration, I've refacted parameters_standardize and created the model <- lm(Sepal.Width ~ Petal.Width * Sepal.Length, data = iris)
info <- parameters::standardize_info(model)
info$Refit <- parameters::parameters_standardize(model, method = "refit")[, 2]
info$Raw <- insight::get_parameters(model)[, 2]
info[sapply(info, is.numeric)] <- sapply(info[sapply(info, is.numeric)], round, digits = 1)
info
#> Parameter Type Factor Deviation_Response
#> 1 (Intercept) intercept <NA> 0.4
#> 2 Petal.Width numeric FALSE 0.4
#> 3 Sepal.Length numeric FALSE 0.4
#> 4 Petal.Width:Sepal.Length interaction FALSE 0.4
#> Mean_Response Deviation_Classic Mean_Classic Deviation_Smart Mean_Smart
#> 1 3.1 0.0 0.0 0.0 0.0
#> 2 3.1 0.8 1.2 0.8 1.2
#> 3 3.1 0.8 5.8 0.8 5.8
#> 4 3.1 5.3 7.5 0.8 5.8
#> Refit Raw
#> 1 -0.2 3.4
#> 2 -0.7 -1.5
#> 3 0.4 0.0
#> 4 0.3 0.2 Created on 2019-10-08 by the reprex package (v0.3.0) In a nutshell, the problem is to try to FIND the "Refit" column from the "Raw" column using the remaining information... |
At the same time, it suggests that the issue with interaction are not real issues, it's just that the estimates correspond to something different, but they are not wrong per se (I think) |
I reckon that's the reason for partial standardized coefficients (https://www.jstor.org/stable/2684719), but we need VIF for that |
@DominiqueMakowski Can you explain what exactly the "smart" method is trying to do? It seems to break somewhat when there are formula-transformations ( (I cases with transformations it will never be equal to method "refit" because the parameters themselves are estimated differently) (As you mention above, this is also the issue with interactions - the centering changes the simple/conditional slope parameters, so it also cannot be the same as with "refit", no?) So broadly asking, what would method "smart" do to this model: log(y) ~ sqrt(x) * some_factor Also, what exactly is the conceptual difference between methods "smart" and "posthoc"? |
TLDR;
No idea Longer story: so in simpler models basic and refit are equivalent. But for more complex models (especially with interactions, transformations etc) basic starts to depart from refit. IMO, refit gives the "gold standard" results, because they don't involve any posthoc transformation. However, the problem is that "refit" is computationally heavy (Bayes). So the goal of "smart" is to be a posthoc method (that does not refit the model from scratch but simply transforms the parameters with the info it has), but that gives the same results as "refit". So in simple cases, there should be no difference between the 3 methods basic, refit and smart. In more complex cases, smart aims at giving the same results as "refit". Basically "smart" is supposed to be "refit" minus the model refitting. So to get back to your initial question of what the result should be in that particular case, the expected result should be the same as the one given by "refit"... About how it works, basically |
but we can move slowly here, it doesn't need to be perfect from scratch, smart can always be equivalent to "basic" when we don't know how to retrieve the "refit"-like coefs. Basically it's like "basic+" method, i.e., it's fast, and will work in most cases, and in some cases it will need you the straightforward "basic" standardization |
imo this is a critical issue to be able to retrieve 'refit' standardized parameters with a posthoc method.
improvements are possible with especially for the case of interaction, but a more robust and systematic testing framework might be needed. also, a knowledge of model matrices when factors are involved appears as key.
The text was updated successfully, but these errors were encountered: