Skip to content

Cluster Methods

ConnorChato edited this page Feb 5, 2019 · 2 revisions

Cluster Techniques

Component Clustering

See https://github.com/veg/hivtrace

Clusters are formed based off of the removal of edges from a completely connected graph. Filtering to create a subgraph with the same number of vertices but connected by only those edges below a specified cutoff threshold.

Long Cluster Problem

The nature of vertex relationships becomes more binary after component clustering (ie. They are either clustered together or not). Unintuitive cases where cases may share a cluster with very distant cases can arise, creating misleading graphs by sequential linkages

Path-Length based clustering

See igraph documentation https://igraph.org/r/doc/cluster_walktrap.html

A common theoretical solution to cluster assignment from weighted graphs using random walks. Cluster assignments are stochastic, as all possible walk paths in larger clusters becomes computationally restrictive. The process of optimizing the tn93 cutoff distance may be similar to an optimization of the steps parameter for walktrap().


Growth Measurements

Embedded Case Growth

See the growG() function in tn93Analysis.R.

Clusters are formed based on cases from year Y and earlier. Case growth is counted based on cases from year Y. Compatible with clmp and tree-based clustering methods.

Foresight Problem

Clusters from year Y would be different from year Y+1 due to new cluster formation and cluster merging. Any growth estimates we have for year Y, would then not ne applicable to year Y+1 using this method (as the growth measurements from year Y+1 would be on a different set of clusters)

Simulated Case Growth

See growthSim() function in tn93Analysis.R

Clusters are formed based on cases from year Y-1 and earlier. Case growth is counted based on cases from year Y being added individually to those clusters, as these are the clusters "Under Observation" and we would like to see how this exact set of clusters grow.

Hindsight Problem

Opposite to the foresight problems experienced by embedded case growth clusters from year Y-1 will be different from those formed at year Y-2. This creates another unrealistic situation, where the clusters from the reference

Merge Problems

As we simulate adding new cases, these cases may bridge multiple clusters together, creating an indexing problem, as what was once considered 2 separate clusters may now be considered one.

Weighted Merge Solution

Default solution used by growthSim()

Closest Merge Solution

Alt option used by growthSim()


Forecasting

The growth at year Y should be somewhat predictable by the information within years up to and including Y-1. If we view growth at Y as an outcome variable and measure it using one of the methods described above, we still need a predictor variable.

Relative Recent Cluster Growth Model

See 2018 NY Study, Wertheim et al

We can use embedded growth measurement to establish the growth from year Y-6 to Y-1, counting all cases from all of those 5 years and dividing by 5. We may also divide by square root cluster size at year Y-1 if we would like to measure relative instead of absolute growth. This avoids the hindsight and foresight problems mentioned above.

Point of View Contradiction Problem

The second figure in simulated growth measurement demonstrates the way this method only counts direct new case linkages. Compared to embedded growth measurement, this is likely to give much lower measurements of cluster growth. This means the predicted growth will be skewed to overestimate the growth at Y.

Growth by new Clusters solutions

If we see new cases in clusters and treat those clusters as objects with a weight defined by size and a valency defined by edges leading from a new case cluster to an old case, then we can apply a method similar to the weighted merge solution.

We can also apply a method similar to the closest merge solution. Which will lead to more extreme variation in individual node growth.

Age-Dependent Frequency Model

The initial frequency function f(x) can be thought of as a Poisson-Linked GLM where the age of a given case predicts the likelihood that it will be connected to newer cases.


Full Poisson Dummy Model

See Nakaya, T (2000)

As a way to obtain stats such as GAIC and VPC, we need to compare clusters at a given cutoff to clusters at a cutoff of 0 (ie. Every individual case is a cluster of size 1). The full model should represent maximum variation in the data and act as a frame of comparison for the variation at a given level of aggregation.

Growth-Threshold Dependency Problem

Unlike Nakaya's solution to the MAUP, our total network growth is effected by threshold cutoff. With a cutoff of 0, no new cases are added and therefore overall case growth will be 0.

Selective Disaggregation Solution

See full option for growth and forecast functions in tn93Analysis.R To keep total network growth static, we can choose to only disaggregate old cases (ie. those coming before Y). We may still need to resolve merges (see merge solutions in Simulated growth measurement).

Singular Past Growth Problem

The full model has difficulty producing meaningful growth estimates using the Relative Recent Growth model past growth only offers binary outcomes for single cases (either they appeared in the last 5 years or did not). The selectively disaggregated full model is however, compatible with the Age-Dependent frequency model for forecasting growth.


Statistical Measurements

GAIC [See Nakaya, T (2000)]

VPC [See Austin, et al (2017)]


Current Method Overview