Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use direct mixture sampling in simulation #21

Merged
merged 2 commits into from
Dec 8, 2024
Merged

Use direct mixture sampling in simulation #21

merged 2 commits into from
Dec 8, 2024

Conversation

nanxstats
Copy link
Owner

@nanxstats nanxstats commented Dec 8, 2024

Closes #20

This PR optimizes the sampling implementation in the synthetic data generation function and makes it at least 10x faster, so that one can simulate X at 100k x 100k scale easily.

Old implementation

In the previous (trivial) implementation, we generate each document by a two-step hierarchical procedure.

  1. Sample topic counts. Say $D_i$ is document length for document $i$. Document-topic distribution is $L[i, :]$.

$$ \text{(topic assignments)} \sim \text{Multinomial}(D_i, L[i, :]) $$

  1. Sample terms per topic. Topic counts for topic $j$ is $t_j$. Topic-term distribution is $F[j, :]$. For each topic $j$:

$$ \text{(terms from topic j)} \sim \text{Multinomial}(t_j, F[j, :]). $$

Then sum up to get document-term counts. It strictly follows the data model but means a double for-loop and is slow.

New implementation

In the new implementation, we combined the two steps into one by directly sampling from the mixture:

$$ \text{(all terms)} \sim \text{Multinomial}(D_i, \sum_j L[i,j] \cdot F[j,:]). $$

This leverages the multinomial distribution property: If $X \sim \text{Multinomial}(n, p)$ and $Y \sim \text{Multinomial}(X, q)$, then $Y \sim \text{Multinomial}(n, p \cdot q)$. Meaning if we first choose a category from a multinomial and then choose an outcome within that category from another multinomial, the result is equivalent to a single multinomial draw from the mixture distribution.

This means a matrix multiplication with a multinomial draw and is much faster. The most time-consuming data in #20 now takes < 5s to generate.

@nanxstats nanxstats merged commit a425b1d into main Dec 8, 2024
4 checks passed
@nanxstats nanxstats deleted the sim branch December 8, 2024 23:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize synthetic data generation speed
1 participant