diff --git a/README.md b/README.md index 185f93f..899b27d 100644 --- a/README.md +++ b/README.md @@ -51,3 +51,4 @@ After tinytopics is installed, try examples from: - [Getting started guide with simulated count data](https://nanx.me/tinytopics/articles/get-started/) - [CPU vs. GPU speed benchmark](https://nanx.me/tinytopics/articles/benchmark/) - [Text data topic modeling example](https://nanx.me/tinytopics/articles/text/) +- [Memory-efficient training](https://nanx.me/tinytopics/articles/memory/) diff --git a/docs/articles/images/memory/usage-100k-100k.png b/docs/articles/images/memory/usage-100k-100k.png new file mode 100644 index 0000000..79745f2 Binary files /dev/null and b/docs/articles/images/memory/usage-100k-100k.png differ diff --git a/docs/articles/images/memory/usage-500k-100k.png b/docs/articles/images/memory/usage-500k-100k.png new file mode 100644 index 0000000..994eb9f Binary files /dev/null and b/docs/articles/images/memory/usage-500k-100k.png differ diff --git a/docs/articles/memory.md b/docs/articles/memory.md new file mode 100644 index 0000000..a9ff032 --- /dev/null +++ b/docs/articles/memory.md @@ -0,0 +1,155 @@ +# Memory-efficient training + + + + +!!! tip + + To run the code from this article as a Python script: + + ```bash + python3 examples/memory.py + ``` + +This article discusses solutions for training topic models on datasets +larger than the available GPU VRAM or system RAM. + +## Training data larger than VRAM but smaller than RAM + +This scenario is manageable. Let’s see an example. We simulate a 100k x +100k dataset, requiring 37GB of memory. In this test, the dataset is +larger than the 24GB VRAM but smaller than the 64GB system RAM. + +``` python +import tinytopics as tt + +tt.set_random_seed(42) + +n, m, k = 100_000, 100_000, 20 +X, true_L, true_F = tt.generate_synthetic_data(n, m, k, avg_doc_length=256 * 256) + +size_gb = X.nbytes / (1024**3) +print(f"Memory size of X: {size_gb:.2f} GB") + +model, losses = tt.fit_model(X, k=k, num_epochs=200) + +tt.plot_loss(losses, output_file="loss.png") +``` + +![](images/memory/usage-100k-100k.png) + +Each epoch takes around 6 seconds. The peak GPU VRAM usage is 23.5GB, +and the peak RAM usage is around 60GB. + +Although the full dataset requires ~40 GB of RAM, the training process +only moves one small batch (controlled by `batch_size` in `fit_model()`) +onto the GPU at a time. The model parameters and a batch of data fit +within the 24GB VRAM, allowing the training to proceed. + +## Stream training data from disk + +A more general solution in PyTorch is to use map-style and +iterable-style datasets to stream data from disk on-demand, without +loading the entire tensor into system memory. + +Starting from tinytopics 0.6.0, you can use the `NumpyDiskDataset` class +to load `.npy` datasets from disk as training data, supported by +`fit_model()`. Here is an example: + +``` python +import numpy as np + +import tinytopics as tt + +tt.set_random_seed(42) + +n, m, k = 100_000, 100_000, 20 +X, true_L, true_F = tt.generate_synthetic_data(n, m, k, avg_doc_length=256 * 256) + +size_gb = X.nbytes / (1024**3) +print(f"Memory size of X: {size_gb:.2f} GB") + +data_path = "X.npy" +np.save(data_path, X.cpu().numpy()) + +del X, true_L, true_F + +dataset = tt.NumpyDiskDataset(data_path) +model, losses = tt.fit_model(dataset, k=k, num_epochs=100) + +tt.plot_loss(losses, output_file="loss.png") +``` + +## Training data larger than RAM + +Let’s demonstrate using a dataset larger than RAM. We will sample the +rows of the previous 100k x 100k dataset to construct a 500k x 100k +dataset, and save it into a 186GB `.npy` file using NumPy memory-mapped +mode. + +``` python +import numpy as np +from tqdm import tqdm + +import tinytopics as tt + +tt.set_random_seed(42) + +# Generate initial data +n, m, k = 100_000, 100_000, 20 +X, true_L, true_F = tt.generate_synthetic_data(n, m, k, avg_doc_length=256 * 256) + +# Save initial data +init_path = "X.npy" +np.save(init_path, X.cpu().numpy()) + +size_gb = X.nbytes / (1024**3) +print(f"Memory size of X: {size_gb:.2f} GB") + +# Free memory +del X, true_L, true_F + +# Create larger dataset by sampling with replacement +n_large = 500_000 +large_path = "X_large.npy" + +# Create empty memory-mapped file +shape = (n_large, m) +large_size_gb = (shape[0] * shape[1] * 4) / (1024**3) # 4 bytes per float32 +print(f"Expected size: {large_size_gb:.2f} GB") + +# Initialize empty memory-mapped numpy array +large_array = np.lib.format.open_memmap( + large_path, + mode="w+", + dtype=np.float32, + shape=shape, +) + +# Read and write in chunks to limit memory usage +chunk_size = 10_000 +n_chunks = n_large // chunk_size + +source_data = np.load(init_path, mmap_mode="r") + +for i in tqdm(range(n_chunks), desc="Generating chunks"): + start_idx = i * chunk_size + end_idx = start_idx + chunk_size + indices = np.random.randint(0, n, size=chunk_size) + large_array[start_idx:end_idx] = source_data[indices] + +# Flush changes to disk +large_array.flush() + +# Train using the large dataset +dataset = tt.NumpyDiskDataset(large_path) +model, losses = tt.fit_model(dataset, k=k, num_epochs=20) + +tt.plot_loss(losses, output_file="loss.png") +``` + +![](images/memory/usage-500k-100k.png) + +Each epoch now takes 5 to 6 minutes due to heavy data movement between +disk, RAM, and VRAM. CPU and RAM usage are both maxed out. The peak VRAM +usage is only 1.6GB, and the peak RAM usage is near 64GB. diff --git a/docs/articles/memory.qmd b/docs/articles/memory.qmd new file mode 100644 index 0000000..4d1bd07 --- /dev/null +++ b/docs/articles/memory.qmd @@ -0,0 +1,158 @@ + + +--- +title: "Memory-efficient training" +format: gfm +eval: false +--- + +!!! tip + + To run the code from this article as a Python script: + + ```bash + python3 examples/memory.py + ``` + +This article discusses solutions for training topic models on datasets +larger than the available GPU VRAM or system RAM. + +## Training data larger than VRAM but smaller than RAM + +This scenario is manageable. Let's see an example. +We simulate a 100k x 100k dataset, requiring 37GB of memory. +In this test, the dataset is larger than the 24GB VRAM but smaller than +the 64GB system RAM. + +```{python} +import tinytopics as tt + +tt.set_random_seed(42) + +n, m, k = 100_000, 100_000, 20 +X, true_L, true_F = tt.generate_synthetic_data(n, m, k, avg_doc_length=256 * 256) + +size_gb = X.nbytes / (1024**3) +print(f"Memory size of X: {size_gb:.2f} GB") + +model, losses = tt.fit_model(X, k=k, num_epochs=200) + +tt.plot_loss(losses, output_file="loss.png") +``` + +![](images/memory/usage-100k-100k.png) + +Each epoch takes around 6 seconds. The peak GPU VRAM usage is 23.5GB, +and the peak RAM usage is around 60GB. + +Although the full dataset requires ~40 GB of RAM, the training process only +moves one small batch (controlled by `batch_size` in `fit_model()`) onto the +GPU at a time. The model parameters and a batch of data fit within the 24GB +VRAM, allowing the training to proceed. + +## Stream training data from disk + +A more general solution in PyTorch is to use map-style and iterable-style +datasets to stream data from disk on-demand, without loading the entire +tensor into system memory. + +Starting from tinytopics 0.6.0, you can use the `NumpyDiskDataset` class to +load `.npy` datasets from disk as training data, supported by `fit_model()`. +Here is an example: + +```{python} +import numpy as np + +import tinytopics as tt + +tt.set_random_seed(42) + +n, m, k = 100_000, 100_000, 20 +X, true_L, true_F = tt.generate_synthetic_data(n, m, k, avg_doc_length=256 * 256) + +size_gb = X.nbytes / (1024**3) +print(f"Memory size of X: {size_gb:.2f} GB") + +data_path = "X.npy" +np.save(data_path, X.cpu().numpy()) + +del X, true_L, true_F + +dataset = tt.NumpyDiskDataset(data_path) +model, losses = tt.fit_model(dataset, k=k, num_epochs=100) + +tt.plot_loss(losses, output_file="loss.png") +``` + +## Training data larger than RAM + +Let's demonstrate using a dataset larger than RAM. We will sample the rows of +the previous 100k x 100k dataset to construct a 500k x 100k dataset, +and save it into a 186GB `.npy` file using NumPy memory-mapped mode. + +```{python} +import numpy as np +from tqdm import tqdm + +import tinytopics as tt + +tt.set_random_seed(42) + +# Generate initial data +n, m, k = 100_000, 100_000, 20 +X, true_L, true_F = tt.generate_synthetic_data(n, m, k, avg_doc_length=256 * 256) + +# Save initial data +init_path = "X.npy" +np.save(init_path, X.cpu().numpy()) + +size_gb = X.nbytes / (1024**3) +print(f"Memory size of X: {size_gb:.2f} GB") + +# Free memory +del X, true_L, true_F + +# Create larger dataset by sampling with replacement +n_large = 500_000 +large_path = "X_large.npy" + +# Create empty memory-mapped file +shape = (n_large, m) +large_size_gb = (shape[0] * shape[1] * 4) / (1024**3) # 4 bytes per float32 +print(f"Expected size: {large_size_gb:.2f} GB") + +# Initialize empty memory-mapped numpy array +large_array = np.lib.format.open_memmap( + large_path, + mode="w+", + dtype=np.float32, + shape=shape, +) + +# Read and write in chunks to limit memory usage +chunk_size = 10_000 +n_chunks = n_large // chunk_size + +source_data = np.load(init_path, mmap_mode="r") + +for i in tqdm(range(n_chunks), desc="Generating chunks"): + start_idx = i * chunk_size + end_idx = start_idx + chunk_size + indices = np.random.randint(0, n, size=chunk_size) + large_array[start_idx:end_idx] = source_data[indices] + +# Flush changes to disk +large_array.flush() + +# Train using the large dataset +dataset = tt.NumpyDiskDataset(large_path) +model, losses = tt.fit_model(dataset, k=k, num_epochs=20) + +tt.plot_loss(losses, output_file="loss.png") +``` + +![](images/memory/usage-500k-100k.png) + +Each epoch now takes 5 to 6 minutes due to heavy data movement between disk, +RAM, and VRAM. CPU and RAM usage are both maxed out. The peak VRAM usage is +only 1.6GB, and the peak RAM usage is near 64GB. diff --git a/docs/index.md b/docs/index.md index ff553b5..94225d9 100644 --- a/docs/index.md +++ b/docs/index.md @@ -51,3 +51,4 @@ After tinytopics is installed, try examples from: - [Getting started guide with simulated count data](https://nanx.me/tinytopics/articles/get-started/) - [CPU vs. GPU speed benchmark](https://nanx.me/tinytopics/articles/benchmark/) - [Text data topic modeling example](https://nanx.me/tinytopics/articles/text/) +- [Memory-efficient training](https://nanx.me/tinytopics/articles/memory/) diff --git a/docs/scripts/sync.sh b/docs/scripts/sync.sh index 024f98e..943c42f 100644 --- a/docs/scripts/sync.sh +++ b/docs/scripts/sync.sh @@ -29,7 +29,7 @@ sync_article() { } # Sync articles -for article in get-started benchmark text; do +for article in get-started benchmark text memory; do sync_article "$article" done diff --git a/examples/memory.py b/examples/memory.py new file mode 100644 index 0000000..4c29a41 --- /dev/null +++ b/examples/memory.py @@ -0,0 +1,85 @@ +import tinytopics as tt + +tt.set_random_seed(42) + +n, m, k = 100_000, 100_000, 20 +X, true_L, true_F = tt.generate_synthetic_data(n, m, k, avg_doc_length=256 * 256) + +size_gb = X.nbytes / (1024**3) +print(f"Memory size of X: {size_gb:.2f} GB") + +model, losses = tt.fit_model(X, k=k, num_epochs=200) + +tt.plot_loss(losses, output_file="loss.png") + +import numpy as np + +import tinytopics as tt + +tt.set_random_seed(42) + +n, m, k = 100_000, 100_000, 20 +X, true_L, true_F = tt.generate_synthetic_data(n, m, k, avg_doc_length=256 * 256) + +size_gb = X.nbytes / (1024**3) +print(f"Memory size of X: {size_gb:.2f} GB") + +data_path = "X.npy" +np.save(data_path, X.cpu().numpy()) + +del X, true_L, true_F + +dataset = tt.NumpyDiskDataset(data_path) +model, losses = tt.fit_model(dataset, k=k, num_epochs=100) + +tt.plot_loss(losses, output_file="loss.png") + +import numpy as np +from tqdm import tqdm + +import tinytopics as tt + +tt.set_random_seed(42) + +n, m, k = 100_000, 100_000, 20 +X, true_L, true_F = tt.generate_synthetic_data(n, m, k, avg_doc_length=256 * 256) + +init_path = "X.npy" +np.save(init_path, X.cpu().numpy()) + +size_gb = X.nbytes / (1024**3) +print(f"Memory size of X: {size_gb:.2f} GB") + +del X, true_L, true_F + +n_large = 500_000 +large_path = "X_large.npy" + +shape = (n_large, m) +large_size_gb = (shape[0] * shape[1] * 4) / (1024**3) # 4 bytes per float32 +print(f"Expected size: {large_size_gb:.2f} GB") + +large_array = np.lib.format.open_memmap( + large_path, + mode="w+", + dtype=np.float32, + shape=shape, +) + +chunk_size = 10_000 +n_chunks = n_large // chunk_size + +source_data = np.load(init_path, mmap_mode="r") + +for i in tqdm(range(n_chunks), desc="Generating chunks"): + start_idx = i * chunk_size + end_idx = start_idx + chunk_size + indices = np.random.randint(0, n, size=chunk_size) + large_array[start_idx:end_idx] = source_data[indices] + +large_array.flush() + +dataset = tt.NumpyDiskDataset(large_path) +model, losses = tt.fit_model(dataset, k=k, num_epochs=20) + +tt.plot_loss(losses, output_file="loss.png") diff --git a/mkdocs.yml b/mkdocs.yml index 16ed01f..e33cd28 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -11,6 +11,7 @@ nav: - articles/get-started.md - articles/benchmark.md - articles/text.md + - articles/memory.md - API Reference: - Fit: reference/fit.md - Models: reference/models.md