diff --git a/README.md b/README.md
index 185f93f..899b27d 100644
--- a/README.md
+++ b/README.md
@@ -51,3 +51,4 @@ After tinytopics is installed, try examples from:
 - [Getting started guide with simulated count data](https://nanx.me/tinytopics/articles/get-started/)
 - [CPU vs. GPU speed benchmark](https://nanx.me/tinytopics/articles/benchmark/)
 - [Text data topic modeling example](https://nanx.me/tinytopics/articles/text/)
+- [Memory-efficient training](https://nanx.me/tinytopics/articles/memory/)
diff --git a/docs/articles/images/memory/usage-100k-100k.png b/docs/articles/images/memory/usage-100k-100k.png
new file mode 100644
index 0000000..79745f2
Binary files /dev/null and b/docs/articles/images/memory/usage-100k-100k.png differ
diff --git a/docs/articles/images/memory/usage-500k-100k.png b/docs/articles/images/memory/usage-500k-100k.png
new file mode 100644
index 0000000..994eb9f
Binary files /dev/null and b/docs/articles/images/memory/usage-500k-100k.png differ
diff --git a/docs/articles/memory.md b/docs/articles/memory.md
new file mode 100644
index 0000000..a9ff032
--- /dev/null
+++ b/docs/articles/memory.md
@@ -0,0 +1,155 @@
+# Memory-efficient training
+
+
+<!-- `.md` and `.py` files are generated from the `.qmd` file. Please edit that file. -->
+
+!!! tip
+
+    To run the code from this article as a Python script:
+
+    ```bash
+    python3 examples/memory.py
+    ```
+
+This article discusses solutions for training topic models on datasets
+larger than the available GPU VRAM or system RAM.
+
+## Training data larger than VRAM but smaller than RAM
+
+This scenario is manageable. Let’s see an example. We simulate a 100k x
+100k dataset, requiring 37GB of memory. In this test, the dataset is
+larger than the 24GB VRAM but smaller than the 64GB system RAM.
+
+``` python
+import tinytopics as tt
+
+tt.set_random_seed(42)
+
+n, m, k = 100_000, 100_000, 20
+X, true_L, true_F = tt.generate_synthetic_data(n, m, k, avg_doc_length=256 * 256)
+
+size_gb = X.nbytes / (1024**3)
+print(f"Memory size of X: {size_gb:.2f} GB")
+
+model, losses = tt.fit_model(X, k=k, num_epochs=200)
+
+tt.plot_loss(losses, output_file="loss.png")
+```
+
+![](images/memory/usage-100k-100k.png)
+
+Each epoch takes around 6 seconds. The peak GPU VRAM usage is 23.5GB,
+and the peak RAM usage is around 60GB.
+
+Although the full dataset requires ~40 GB of RAM, the training process
+only moves one small batch (controlled by `batch_size` in `fit_model()`)
+onto the GPU at a time. The model parameters and a batch of data fit
+within the 24GB VRAM, allowing the training to proceed.
+
+## Stream training data from disk
+
+A more general solution in PyTorch is to use map-style and
+iterable-style datasets to stream data from disk on-demand, without
+loading the entire tensor into system memory.
+
+Starting from tinytopics 0.6.0, you can use the `NumpyDiskDataset` class
+to load `.npy` datasets from disk as training data, supported by
+`fit_model()`. Here is an example:
+
+``` python
+import numpy as np
+
+import tinytopics as tt
+
+tt.set_random_seed(42)
+
+n, m, k = 100_000, 100_000, 20
+X, true_L, true_F = tt.generate_synthetic_data(n, m, k, avg_doc_length=256 * 256)
+
+size_gb = X.nbytes / (1024**3)
+print(f"Memory size of X: {size_gb:.2f} GB")
+
+data_path = "X.npy"
+np.save(data_path, X.cpu().numpy())
+
+del X, true_L, true_F
+
+dataset = tt.NumpyDiskDataset(data_path)
+model, losses = tt.fit_model(dataset, k=k, num_epochs=100)
+
+tt.plot_loss(losses, output_file="loss.png")
+```
+
+## Training data larger than RAM
+
+Let’s demonstrate using a dataset larger than RAM. We will sample the
+rows of the previous 100k x 100k dataset to construct a 500k x 100k
+dataset, and save it into a 186GB `.npy` file using NumPy memory-mapped
+mode.
+
+``` python
+import numpy as np
+from tqdm import tqdm
+
+import tinytopics as tt
+
+tt.set_random_seed(42)
+
+# Generate initial data
+n, m, k = 100_000, 100_000, 20
+X, true_L, true_F = tt.generate_synthetic_data(n, m, k, avg_doc_length=256 * 256)
+
+# Save initial data
+init_path = "X.npy"
+np.save(init_path, X.cpu().numpy())
+
+size_gb = X.nbytes / (1024**3)
+print(f"Memory size of X: {size_gb:.2f} GB")
+
+# Free memory
+del X, true_L, true_F
+
+# Create larger dataset by sampling with replacement
+n_large = 500_000
+large_path = "X_large.npy"
+
+# Create empty memory-mapped file
+shape = (n_large, m)
+large_size_gb = (shape[0] * shape[1] * 4) / (1024**3)  # 4 bytes per float32
+print(f"Expected size: {large_size_gb:.2f} GB")
+
+# Initialize empty memory-mapped numpy array
+large_array = np.lib.format.open_memmap(
+    large_path,
+    mode="w+",
+    dtype=np.float32,
+    shape=shape,
+)
+
+# Read and write in chunks to limit memory usage
+chunk_size = 10_000
+n_chunks = n_large // chunk_size
+
+source_data = np.load(init_path, mmap_mode="r")
+
+for i in tqdm(range(n_chunks), desc="Generating chunks"):
+    start_idx = i * chunk_size
+    end_idx = start_idx + chunk_size
+    indices = np.random.randint(0, n, size=chunk_size)
+    large_array[start_idx:end_idx] = source_data[indices]
+
+# Flush changes to disk
+large_array.flush()
+
+# Train using the large dataset
+dataset = tt.NumpyDiskDataset(large_path)
+model, losses = tt.fit_model(dataset, k=k, num_epochs=20)
+
+tt.plot_loss(losses, output_file="loss.png")
+```
+
+![](images/memory/usage-500k-100k.png)
+
+Each epoch now takes 5 to 6 minutes due to heavy data movement between
+disk, RAM, and VRAM. CPU and RAM usage are both maxed out. The peak VRAM
+usage is only 1.6GB, and the peak RAM usage is near 64GB.
diff --git a/docs/articles/memory.qmd b/docs/articles/memory.qmd
new file mode 100644
index 0000000..4d1bd07
--- /dev/null
+++ b/docs/articles/memory.qmd
@@ -0,0 +1,158 @@
+<!-- `.md` and `.py` files are generated from the `.qmd` file. Please edit that file. -->
+
+---
+title: "Memory-efficient training"
+format: gfm
+eval: false
+---
+
+!!! tip
+
+    To run the code from this article as a Python script:
+
+    ```bash
+    python3 examples/memory.py
+    ```
+
+This article discusses solutions for training topic models on datasets
+larger than the available GPU VRAM or system RAM.
+
+## Training data larger than VRAM but smaller than RAM
+
+This scenario is manageable. Let's see an example.
+We simulate a 100k x 100k dataset, requiring 37GB of memory.
+In this test, the dataset is larger than the 24GB VRAM but smaller than
+the 64GB system RAM.
+
+```{python}
+import tinytopics as tt
+
+tt.set_random_seed(42)
+
+n, m, k = 100_000, 100_000, 20
+X, true_L, true_F = tt.generate_synthetic_data(n, m, k, avg_doc_length=256 * 256)
+
+size_gb = X.nbytes / (1024**3)
+print(f"Memory size of X: {size_gb:.2f} GB")
+
+model, losses = tt.fit_model(X, k=k, num_epochs=200)
+
+tt.plot_loss(losses, output_file="loss.png")
+```
+
+![](images/memory/usage-100k-100k.png)
+
+Each epoch takes around 6 seconds. The peak GPU VRAM usage is 23.5GB,
+and the peak RAM usage is around 60GB.
+
+Although the full dataset requires ~40 GB of RAM, the training process only
+moves one small batch (controlled by `batch_size` in `fit_model()`) onto the
+GPU at a time. The model parameters and a batch of data fit within the 24GB
+VRAM, allowing the training to proceed.
+
+## Stream training data from disk
+
+A more general solution in PyTorch is to use map-style and iterable-style
+datasets to stream data from disk on-demand, without loading the entire
+tensor into system memory.
+
+Starting from tinytopics 0.6.0, you can use the `NumpyDiskDataset` class to
+load `.npy` datasets from disk as training data, supported by `fit_model()`.
+Here is an example:
+
+```{python}
+import numpy as np
+
+import tinytopics as tt
+
+tt.set_random_seed(42)
+
+n, m, k = 100_000, 100_000, 20
+X, true_L, true_F = tt.generate_synthetic_data(n, m, k, avg_doc_length=256 * 256)
+
+size_gb = X.nbytes / (1024**3)
+print(f"Memory size of X: {size_gb:.2f} GB")
+
+data_path = "X.npy"
+np.save(data_path, X.cpu().numpy())
+
+del X, true_L, true_F
+
+dataset = tt.NumpyDiskDataset(data_path)
+model, losses = tt.fit_model(dataset, k=k, num_epochs=100)
+
+tt.plot_loss(losses, output_file="loss.png")
+```
+
+## Training data larger than RAM
+
+Let's demonstrate using a dataset larger than RAM. We will sample the rows of
+the previous 100k x 100k dataset to construct a 500k x 100k dataset,
+and save it into a 186GB `.npy` file using NumPy memory-mapped mode.
+
+```{python}
+import numpy as np
+from tqdm import tqdm
+
+import tinytopics as tt
+
+tt.set_random_seed(42)
+
+# Generate initial data
+n, m, k = 100_000, 100_000, 20
+X, true_L, true_F = tt.generate_synthetic_data(n, m, k, avg_doc_length=256 * 256)
+
+# Save initial data
+init_path = "X.npy"
+np.save(init_path, X.cpu().numpy())
+
+size_gb = X.nbytes / (1024**3)
+print(f"Memory size of X: {size_gb:.2f} GB")
+
+# Free memory
+del X, true_L, true_F
+
+# Create larger dataset by sampling with replacement
+n_large = 500_000
+large_path = "X_large.npy"
+
+# Create empty memory-mapped file
+shape = (n_large, m)
+large_size_gb = (shape[0] * shape[1] * 4) / (1024**3)  # 4 bytes per float32
+print(f"Expected size: {large_size_gb:.2f} GB")
+
+# Initialize empty memory-mapped numpy array
+large_array = np.lib.format.open_memmap(
+    large_path,
+    mode="w+",
+    dtype=np.float32,
+    shape=shape,
+)
+
+# Read and write in chunks to limit memory usage
+chunk_size = 10_000
+n_chunks = n_large // chunk_size
+
+source_data = np.load(init_path, mmap_mode="r")
+
+for i in tqdm(range(n_chunks), desc="Generating chunks"):
+    start_idx = i * chunk_size
+    end_idx = start_idx + chunk_size
+    indices = np.random.randint(0, n, size=chunk_size)
+    large_array[start_idx:end_idx] = source_data[indices]
+
+# Flush changes to disk
+large_array.flush()
+
+# Train using the large dataset
+dataset = tt.NumpyDiskDataset(large_path)
+model, losses = tt.fit_model(dataset, k=k, num_epochs=20)
+
+tt.plot_loss(losses, output_file="loss.png")
+```
+
+![](images/memory/usage-500k-100k.png)
+
+Each epoch now takes 5 to 6 minutes due to heavy data movement between disk,
+RAM, and VRAM. CPU and RAM usage are both maxed out. The peak VRAM usage is
+only 1.6GB, and the peak RAM usage is near 64GB.
diff --git a/docs/index.md b/docs/index.md
index ff553b5..94225d9 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -51,3 +51,4 @@ After tinytopics is installed, try examples from:
 - [Getting started guide with simulated count data](https://nanx.me/tinytopics/articles/get-started/)
 - [CPU vs. GPU speed benchmark](https://nanx.me/tinytopics/articles/benchmark/)
 - [Text data topic modeling example](https://nanx.me/tinytopics/articles/text/)
+- [Memory-efficient training](https://nanx.me/tinytopics/articles/memory/)
diff --git a/docs/scripts/sync.sh b/docs/scripts/sync.sh
index 024f98e..943c42f 100644
--- a/docs/scripts/sync.sh
+++ b/docs/scripts/sync.sh
@@ -29,7 +29,7 @@ sync_article() {
 }
 
 # Sync articles
-for article in get-started benchmark text; do
+for article in get-started benchmark text memory; do
     sync_article "$article"
 done
 
diff --git a/examples/memory.py b/examples/memory.py
new file mode 100644
index 0000000..4c29a41
--- /dev/null
+++ b/examples/memory.py
@@ -0,0 +1,85 @@
+import tinytopics as tt
+
+tt.set_random_seed(42)
+
+n, m, k = 100_000, 100_000, 20
+X, true_L, true_F = tt.generate_synthetic_data(n, m, k, avg_doc_length=256 * 256)
+
+size_gb = X.nbytes / (1024**3)
+print(f"Memory size of X: {size_gb:.2f} GB")
+
+model, losses = tt.fit_model(X, k=k, num_epochs=200)
+
+tt.plot_loss(losses, output_file="loss.png")
+
+import numpy as np
+
+import tinytopics as tt
+
+tt.set_random_seed(42)
+
+n, m, k = 100_000, 100_000, 20
+X, true_L, true_F = tt.generate_synthetic_data(n, m, k, avg_doc_length=256 * 256)
+
+size_gb = X.nbytes / (1024**3)
+print(f"Memory size of X: {size_gb:.2f} GB")
+
+data_path = "X.npy"
+np.save(data_path, X.cpu().numpy())
+
+del X, true_L, true_F
+
+dataset = tt.NumpyDiskDataset(data_path)
+model, losses = tt.fit_model(dataset, k=k, num_epochs=100)
+
+tt.plot_loss(losses, output_file="loss.png")
+
+import numpy as np
+from tqdm import tqdm
+
+import tinytopics as tt
+
+tt.set_random_seed(42)
+
+n, m, k = 100_000, 100_000, 20
+X, true_L, true_F = tt.generate_synthetic_data(n, m, k, avg_doc_length=256 * 256)
+
+init_path = "X.npy"
+np.save(init_path, X.cpu().numpy())
+
+size_gb = X.nbytes / (1024**3)
+print(f"Memory size of X: {size_gb:.2f} GB")
+
+del X, true_L, true_F
+
+n_large = 500_000
+large_path = "X_large.npy"
+
+shape = (n_large, m)
+large_size_gb = (shape[0] * shape[1] * 4) / (1024**3)  # 4 bytes per float32
+print(f"Expected size: {large_size_gb:.2f} GB")
+
+large_array = np.lib.format.open_memmap(
+    large_path,
+    mode="w+",
+    dtype=np.float32,
+    shape=shape,
+)
+
+chunk_size = 10_000
+n_chunks = n_large // chunk_size
+
+source_data = np.load(init_path, mmap_mode="r")
+
+for i in tqdm(range(n_chunks), desc="Generating chunks"):
+    start_idx = i * chunk_size
+    end_idx = start_idx + chunk_size
+    indices = np.random.randint(0, n, size=chunk_size)
+    large_array[start_idx:end_idx] = source_data[indices]
+
+large_array.flush()
+
+dataset = tt.NumpyDiskDataset(large_path)
+model, losses = tt.fit_model(dataset, k=k, num_epochs=20)
+
+tt.plot_loss(losses, output_file="loss.png")
diff --git a/mkdocs.yml b/mkdocs.yml
index 16ed01f..e33cd28 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -11,6 +11,7 @@ nav:
     - articles/get-started.md
     - articles/benchmark.md
     - articles/text.md
+    - articles/memory.md
   - API Reference:
     - Fit: reference/fit.md
     - Models: reference/models.md