-
Hello, I am trying to combine the KML clustering method in latrend with SillyPutty (section "Combining SillyPutty With Hierarchical Clustering"). Specifically, I am trying to do the following: "To apply SillyPutty to an already precomputed clustering algorithm, you have to have the cluster identities of the clustering algorithm and the distance matrix of the data set. SillyPutty will then recalculate the clusters from a starting point within the post-clustered clusters and return the best silhouette width score and the new cluster identities." I am having trouble finding 1) the cluster identities of the clustering algorithm and 2) the distance matrix. Could you explain how I might find these 2 objects using the lcMethodKML function()? I understand that I can fit the KML model as follows:
Select 4-cluster model as preferred representation I would like to use the 4-cluster KML model and enhance it with Silly Putty, but I am not sure how to extract the clustering identities and the distance matrix (Euclidean) such that I can run the following as in their example: I believe the cluster assignments can be extracted using Would you be able to help me with this? Thank you! |
Beta Was this translation helpful? Give feedback.
Replies: 7 comments
-
Hi @hichew22, thanks for the detailed post. The cluster assignment for the trajectories can be converted to integer by using KmL is not a hierarchical cluster algorithm so it does not compute/consider the pairwise distances between trajectories. That said, since First, the data needs to be structured with one trajectory per row, and each column representing a different time point. You can use latrend's tsdata = tsmatrix(df, response='value')
d = dist(tsdata) If SillyPutty expects It's important that the cluster assignments vector and distance matrix rows have the same order (i.e., refer to the same trajectories). With the code I posted this is the case. The order of the assignments can be obtained using |
Beta Was this translation helpful? Give feedback.
-
Hi Niek, thank you very much for your help! I think I was able to figure out how to do this. If I have a dataframe containing the new cluster assignments, is there a way to plot the assigned cluster trajectories? I can join the cluster assignments with my longitudinal dataframe and then use ggplot as so:
However, I would like to include the colored lines for the cluster trajectories as in |
Beta Was this translation helpful? Give feedback.
-
You're welcome! You can use the plotClusterTrajectories(df, cluster = 'newcluster', trajectories = TRUE, facet = TRUE) Have a look at the function's documentation for more options. Alternatively, if you want to overlay the newly assigned trajectories with the original cluster trajectories computed by KmL, we'll need to manually combine these two by plotting the trajectories, and then drawing the cluster trajectories over it: # extract cluster trajectories data.frame
df_cluster = clusterTrajectories(kml_model_4)
# We need matching names between df_cluster and df_traj for facetting set by plotTrajectories()
df$Cluster = df_traj$new_cluster
plotTrajectories(df, response = 'value', cluster = 'Cluster', facet = TRUE) + geom_line(data = df_cluster, aes(x = day, y = value, color = Cluster)) In case of errors check whether the time, id, response and cluster arguments are correctly specified. Also, the names of the clusters need to be the same between the data frames. |
Beta Was this translation helpful? Give feedback.
-
Hi Niek, thank you for your help! Those are what I would like to plot. I need a little more guidance in setting up the dataframes correctly for the above code. First, I started with df_lab, which contains the longitudinal laboratory values for each individual (multiple timepoints per individual). This dataframe is what I used to fit the kml_model_4 on. I extracted the cluster assignments and ids as follows:
Then, I used your guidance to 1) restructure the longitudinal data with one trajectory per row, and each column representing a different time point using latrend's tsmatrix function and 2) compute the Euclidean distance using R's dist function as follows:
Lastly, I used the SillyPutty function to combine SillyPutty with the kml clustering and created a dataframe combining the ids, kml clusters, and SillyPutty clusters:
Could you explain how these dataframes and clusters would fit into the plotClusterTrajectories and plotTrajectories examples you provided? Do I need to join df_lab and df_cluster_combine? Thank you! |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
There's currently no option for the I will create an issue because I think it would be a nice feature to have. For now, you can achieve it by: props = prop.table(table(df_new$new_cluster))
cluster_labels = sprintf('%s (%d%%)', names(props), round(props * 100))
df_new$new_cluster_label = factor(df_new$new_cluster, levels = names(props), labels = cluster_labels))
plotClusterTrajectories(
df_new,
response = "lab",
cluster = "new_cluster_label",
trajectories = TRUE,
facet = TRUE
) |
Beta Was this translation helpful? Give feedback.
-
Awesome, thank you so much for all your help!! |
Beta Was this translation helpful? Give feedback.
Hi @hichew22, thanks for the detailed post.
The cluster assignment for the trajectories can be converted to integer by using
as.integer(cluster)
, where 1=first cluster, 2=second cluster, etc.KmL is not a hierarchical cluster algorithm so it does not compute/consider the pairwise distances between trajectories. That said, since$k$ -means uses the Euclidean distance, we can compute a distance matrix ourselves.
First, the data needs to be structured with one trajectory per row, and each column representing a different time point. You can use latrend's
tsmatrix
function for that. Then we can compute the Euclidean distance matrix using R'sdist
function.