Use direct mixture sampling in simulation #21

nanxstats · 2024-12-08T08:00:14Z

Closes #20

This PR optimizes the sampling implementation in the synthetic data generation function and makes it at least 10x faster, so that one can simulate X at 100k x 100k scale easily.

Old implementation

In the previous (trivial) implementation, we generate each document by a two-step hierarchical procedure.

Sample topic counts. Say $D_i$ is document length for document $i$. Document-topic distribution is $L[i, :]$.

$$ \text{(topic assignments)} \sim \text{Multinomial}(D_i, L[i, :]) $$

Sample terms per topic. Topic counts for topic $j$ is $t_j$. Topic-term distribution is $F[j, :]$. For each topic $j$:

$$ \text{(terms from topic j)} \sim \text{Multinomial}(t_j, F[j, :]). $$

Then sum up to get document-term counts. It strictly follows the data model but means a double for-loop and is slow.

New implementation

In the new implementation, we combined the two steps into one by directly sampling from the mixture:

$$ \text{(all terms)} \sim \text{Multinomial}(D_i, \sum_j L[i,j] \cdot F[j,:]). $$

This leverages the multinomial distribution property: If $X \sim \text{Multinomial}(n, p)$ and $Y \sim \text{Multinomial}(X, q)$, then $Y \sim \text{Multinomial}(n, p \cdot q)$. Meaning if we first choose a category from a multinomial and then choose an outcome within that category from another multinomial, the result is equivalent to a single multinomial draw from the mixture distribution.

This means a matrix multiplication with a multinomial draw and is much faster. The most time-consuming data in #20 now takes < 5s to generate.

nanxstats added 2 commits December 8, 2024 02:24

Use direct mixture sampling in simulation

b7ecda4

Add test for reproducible simulations

b2e9adc

nanxstats merged commit a425b1d into main Dec 8, 2024
4 checks passed

nanxstats deleted the sim branch December 8, 2024 23:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use direct mixture sampling in simulation #21

Use direct mixture sampling in simulation #21

nanxstats commented Dec 8, 2024 •

edited

Loading

Use direct mixture sampling in simulation #21

Use direct mixture sampling in simulation #21

Conversation

nanxstats commented Dec 8, 2024 • edited Loading

Old implementation

New implementation

nanxstats commented Dec 8, 2024 •

edited

Loading