Skip to content

Commit

Permalink
Deployed 50db051 with MkDocs version: 1.6.1
Browse files Browse the repository at this point in the history
  • Loading branch information
github-actions[bot] committed Oct 20, 2024
0 parents commit 11a50af
Show file tree
Hide file tree
Showing 68 changed files with 13,958 additions and 0 deletions.
Empty file added .nojekyll
Empty file.
543 changes: 543 additions & 0 deletions 404.html

Large diffs are not rendered by default.

139 changes: 139 additions & 0 deletions articles/get-started.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
<!-- `.md` and `.py` files are generated from the `.qmd` file. Please edit that file. -->

---
title: "Get started"
format: gfm
eval: false
---

!!! tip

To run the code from this article as a Python script:

```bash
python3 examples/get-started.py
```

## Import tinytopics

```{python}
from tinytopics.fit import fit_model
from tinytopics.plot import plot_loss, plot_structure, plot_top_terms
from tinytopics.utils import (
set_random_seed,
generate_synthetic_data,
align_topics,
sort_documents,
)
```

## Generate synthetic data

Set random seed for reproducibility and generate synthetic data

```{python}
set_random_seed(42)
n, m, k = 5000, 1000, 10
X, true_L, true_F = generate_synthetic_data(n, m, k, avg_doc_length=256 * 256)
```

## Training

Train the model

```{python}
model, losses = fit_model(X, k, learning_rate=0.01)
```

Plot loss curve

```{python}
plot_loss(losses, output_file="loss.png")
```

![](images/loss.png)

!!! tip

The performance of the model can be sensitive to the learning rate.
If you experience suboptimal results or observe performance discrepancies
between the model trained on CPU and GPU, tuning the learning rate can help.

For example, using the default learning rate of 0.001 on this synthetic
dataset can lead to inconsistent results between devices (worse model
on CPU than GPU). Increasing the learning rate towards 0.01 significantly
improves model fit and ensures consistent performance across both devices.

## Post-process results

Derive matrices

```{python}
learned_L = model.get_normalized_L().numpy()
learned_F = model.get_normalized_F().numpy()
```

Align topics

```{python}
aligned_indices = align_topics(true_F, learned_F)
learned_F_aligned = learned_F[aligned_indices]
learned_L_aligned = learned_L[:, aligned_indices]
```

Sort documents

```{python}
sorted_indices = sort_documents(true_L)
true_L_sorted = true_L[sorted_indices]
learned_L_sorted = learned_L_aligned[sorted_indices]
```

## Visualize results

STRUCTURE plot

```{python}
plot_structure(
true_L_sorted,
title="True Document-Topic Distributions (Sorted)",
output_file="L-true.png",
)
```

![](images/L-true.png)

```{python}
plot_structure(
learned_L_sorted,
title="Learned Document-Topic Distributions (Sorted and Aligned)",
output_file="L-learned.png",
)
```

![](images/L-learned.png)

Top terms plot

```{python}
plot_top_terms(
true_F,
n_top_terms=15,
title="Top Terms per Topic - True F Matrix",
output_file="F-top-terms-true.png",
)
```

![](images/F-top-terms-true.png)

```{python}
plot_top_terms(
learned_F_aligned,
n_top_terms=15,
title="Top Terms per Topic - Learned F Matrix (Aligned)",
output_file="F-top-terms-learned.png",
)
```

![](images/F-top-terms-learned.png)
Loading

0 comments on commit 11a50af

Please sign in to comment.