Provide Memory Benchmarking Feature to Benchmarking Code #14

achew010 · 2024-05-17T07:01:53Z

Description

This PR adds GPU memory logging features to the benchmark script according to #8 and an updated benchmark README for usage instructions.

There are 2 approaches to logging memory,

Using Nvidia's SMI CLI tool
Using Huggingface's HFTrainer's API

Note: Issue #19 is created to address the grouping of memory values using a common prefix and will be addressed in future

Usage

1. Nvidia's SMI CLI tool

Set environment variable MEMORY_LOGGING=nvidia to use run_benchmarks.sh with nvidia logging

For each experiment,

Before the experiment calls subprocess.run, it will open an async nvidia-smi process to monitor only gpu indices in $CUDA_VISIBLE_DEVICES and log to FILE_MEM inside Experiment.save_dir
After the experiment subprocess call is completed. It terminates the async process.
At the end of each experiment, reads and aggregates an average memory over time (MiB / per sec) for each device and finally saves the average memory across all devices into gpu_mem in the main result logging function Experiment.write_results
Since it is an independent process called to measure the device, no expected slowdown in training speed

Each experiment directory will have a gpu log that contains <Timestamp>, <GPU Name>, <GPU ID>, <GPU Memory Used>

The memory readings will be reflected in the results raw_summary.csv under the column 'nvidia_mem_reserved' where the raw values are reported in MiB

2. Torch CUDA through Huggingface's HFTrainer's API

Set environment variable MEMORY_LOGGING=huggingface to use run_benchmarks.sh with huggingface logging (default)

HFTrainer has a feature to log memory through the skip_memory_metrics=False training argument. In their documentation, it is mentioned that setting this argument to False will affect training speed. In our tests so far (below), we do not see significant difference in throughput (tokens/sec) when using this argument.

A set of finegrain GPU readings will show as additional columns in the results raw_summary.csv where the raw values are reported in bytes

3. Log Both

Set environment variable MEMORY_LOGGING=all to use run_benchmarks.sh with both logging methods

4. Difference between Nvidia-SMI Utility and Torch CUDA through HFTrainer API

1. The Nvidia-SMI Utility is a coarse measurement tool that captures anything takes up GPU memory. It is simple and non-intrusive as it doesn't involve probing the trainer. It uses the NVML library to fetch reserved memory for each device ID -

All running computer processes using that device
All gpu memory used inside the training script
Any other caching of gpu memory by the training script

Note: To get accurate measurements, no other processes should be running on the device apart from the target process itself.

2. The HFTrainer API is a more precise tool that logs memory usage for a couple of operations inside HFTrainer

It uses torch.cuda.memory_allocated to probe the trainer by taking snapshots of allocated memory and storing the differences between the before and after of each stage. The following stages are probed -

Before Trainer init
Trainer.__init__,
Trainer.train,
Trainer.evaluate,
Trainer.predict.

Note: Any gpu memory accessed and used outside any of these stages or not part of HFTrainer will not be tracked. If the train script does not use the Huggingface trainer then this API will not work as well.

Note: Details on Memory Calculations from HFTrainer for GPTQ-LoRA + FSDP

This is an example of the memory values that HFTrainer will produce in the outputs of train()

output_metrics = {
    'train_runtime': 191.2491, 
    'train_samples_per_second': 0.209, 
    'train_steps_per_second': 0.052, 
    'train_tokens_per_second': 428.342, 
    'train_loss': 1.0627506256103516, 
    'init_mem_cpu_alloc_delta': 4096, 
    'init_mem_gpu_alloc_delta': 0, 
    'init_mem_cpu_peaked_delta': 0, 
    'init_mem_gpu_peaked_delta': 0, 
    'train_mem_cpu_alloc_delta': 839086080, 
    'train_mem_gpu_alloc_delta': -17491768832, 
    'train_mem_cpu_peaked_delta': 0, 
    'train_mem_gpu_peaked_delta': 26747825664, 
    'before_init_mem_cpu': 5513297920, 
    'before_init_mem_gpu': 36141687296, 
    'epoch': 0.01
}

We refer to the keys of the memory metrics in this order

before_init_mem_X as stage0
init_mem_X as stage1
train_mem_X as stage2
...

We currently compute the memory values in the report by taking the largest of sums. For example:

For allocated memory value

max([
  stage0_mem + stage1_allocated_delta, 
  stage0_mem + stage1_allocated_delta + stage2_allocated_delta,
  ...
])

For peak memory value

max([
  stage0_mem + stage1_allocated_delta + stage1_peaked_delta, 
  stage0_mem + stage1_allocated_delta + stage2_allocated_delta + stage2_peaked_delta,
  ...
])

Notice that we do not include stage0_mem alone when computing the max value. This is to avoid misleading comparisons between GPTQ-LoRA and other approaches that support low-memory mode. GPTQ-LoRA + FSDP currently does not support low-memory mode as mentioned #18.

The stage0_mem value of GPTQ-LoRA + FSDP will reflect a larger value as it is loaded fully before the trainer is initialized and then subsequently will be sharded internally in trainer.prepare.

This might cause some misleading comparisons when other variants are loaded in low-memory mode and have smaller stage0_mem memory consumption than GPTQ-LoRA + FSDP before its sharding. Once low-memory mode is supported for GPTQ-LoRA, we will include stage0_mem back inside the max computation

Tests

Memory Measurement Accuracy and Potential Side Effects

1. No Significant Slowdown From Using HFTrainer Memory Probes API on QLoRA Training

In both the Mistral7B model and Mixtral model, introducing the memory probes do not show a significant impact to the throughput of the training run (50 steps). Generally with larger batch sizes and models, the overhead of memory logging becomes insignificant.

A. <100 toks/sec slowdown after introducing the memory probes for Mistral,

model_name_or_path	num gpus	per device batch size	throughput with no mem probe (toks/sec)	throughput with mem probe (toks/sec)
mistralai/Mistral-7B-v0.1	1	4	3465	3432
mistralai/Mistral-7B-v0.1	2	2	2973	2931
mistralai/Mistral-7B-v0.1	1	8	3489	3508
mistralai/Mistral-7B-v0.1	2	4	3383	3298

B. <100 toks/sec slowdown after introducing the memory probes for Mixtral

model_name_or_path	num gpus	per device batch size	throughput with no_mem_probe (toks/sec)	throughput with mem_probe (toks/sec)
mistralai/Mixtral-8x7B-Instruct-v0.1	1	4	1785	1776
mistralai/Mixtral-8x7B-Instruct-v0.1	2	2	1518	1442
mistralai/Mixtral-8x7B-Instruct-v0.1	1	8	1938	1933
mistralai/Mixtral-8x7B-Instruct-v0.1	2	4	1757	1724

2. Torch/HF shows more granularity of memory usage with peak memory and actual allocated memory than Nvidia's reserved memory. This is more helpful when analyzing the actual memory allocated for each model.

We compare the 2 memory tracking methods (Nvidia vs Torch/HF) on single devices for both GPTQ-LoRA and QLoRA. Nvidia's peak mem reserved reports larger values than Torch/HF peak mem alloc, torch mem alloc shows the actual memory usage is lesser.

model_name_or_path	num gpus	per device batch size	peak nvidia mem reserved (GiB)	peak torch mem alloc (GiB)	torch mem alloc (GiB)
mistralai/Mistral-7B-v0.1	1	4	19.46	15.86	4.84
TheBloke/Mistral-7B-v0.1-GPTQ	1	4	19.97	15.89	4.87
mistralai/Mixtral-8x7B-Instruct-v0.1	1	4	37.49	36.22	25.2
TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ	1	4	36.59	35.53	24.51
NousResearch/Llama-2-70b-hf	1	4	71.12	68.16	37.35
TheBloke/Llama-2-70B-GPTQ	1	4	70.51	65.9	36.29

3. Memory Usage Decreases on Distributed Finetuning

When running large models on multiple devices, torch mem alloc shows the memory usage decreases as the models are sharded (Comparing to table above).

model_name_or_path	num gpus	per device batch size	peak nvidia mem reserved (GiB)	peak torch mem alloc (GiB)	torch mem alloc (GiB)
mistralai/Mistral-7B-v0.1	2	4	20.97	16.59	2.73
TheBloke/Mistral-7B-v0.1-GPTQ	2	4	23.75	16.26	3.01
mistralai/Mixtral-8x7B-Instruct-v0.1	2	4	32.59	29.33	13.22
TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ	2	4	51.1	27.74	12.79

Verified that torch mem alloc for GPTQ-LoRA on Llama2-70B hovers at 19GiB when sharded after trainer.prepare and during training. The values are similar to the manual probed values from this #15.

model_name_or_path	num gpus	per device batch size	peak nvidia mem reserved (GiB)	peak torch mem alloc (GiB)	torch mem alloc (GiB)
NousResearch/Llama-2-70b-hf	2	2	51.49	46.52	19.17
TheBloke/Llama-2-70B-GPTQ	2	2	78.69	45.4	18.65

Benchmarks

Run tox -e run_benches to produce benchmarks. Full benchmark details can be referenced here

4. For small models, LoRA runs faster than the Quantized PEFT methods. One likely reason could be that it doesnt require an additional dequantization operation before the base layer+LoRA matmuls. While this is so, we also observe the significantly larger memory it consumed compared to the Quantized PEFT methods.

model_name_or_path	Training Type	Accel. Config Type	num gpus	per device batch size	peak nvidia mem reserved (GiB)	peak torch mem alloc (GiB)	torch mem alloc (GiB)	throughput (toks/sec)
mistralai/Mistral-7B-v0.1	lora	none	1	4	29.03	26.11	15.12	3597
mistralai/Mistral-7B-v0.1	lora	accelerated-peft-bnb	1	4	19.46	15.86	4.84	3428
TheBloke/Mistral-7B-v0.1-GPTQ	lora	accelerated-peft-autogptq	1	4	19.97	15.89	4.87	3254

5. We observe that on single device finetuning for larger models (e.g. 49B Mixtral), PEFT begins to run out of memory while the Quantized PEFT methods continue to maintain low memory consumption.

model_name_or_path	Training Type	Accel. Config Type	num gpus	per device batch size	peak nvidia mem reserved (GiB)	peak torch mem alloc (GiB)	torch mem alloc (GiB)	throughput (toks/sec)
mistralai/Mixtral-8x7B-Instruct-v0.1	none	none	1	4	79.14	OOM	OOM	OOM
mistralai/Mixtral-8x7B-Instruct-v0.1	lora	none	1	4	79.06	OOM	OOM	OOM
mistralai/Mixtral-8x7B-Instruct-v0.1	lora	baseline-peft-bnb	1	4	47.18	46.42	25.73	1396
mistralai/Mixtral-8x7B-Instruct-v0.1	lora	accelerated-peft-bnb	1	4	37.5	36.22	25.2	1764
TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ	lora	accelerated-peft-autogptq	1	4	36.58	35.53	24.51	1864

6. In distributed finetuning for large models like Llama2-70B, GPTQ-LoRA shows the lowest memory consumption with the same throughput.

model_name_or_path	Training Type	Accel. Config Type	num gpus	per device batch size	peak nvidia mem reserved (GiB)	peak torch mem alloc (GiB)	torch mem alloc (GiB)	throughput (toks/sec)
NousResearch/Llama-2-70b-hf	lora	none	2	2	79	OOM	OOM	OOM
NousResearch/Llama-2-70b-hf	lora	accelerated-peft-bnb	2	2	51.4	46.52	19.17	418
TheBloke/Llama-2-70B-GPTQ	lora	accelerated-peft-autogptq	2	2	78.5	45.4	18.65	426

7. Increasing the batch size, GPTQ-LoRA is the only experiment that doesnt run out of memory.

model_name_or_path	Training Type	Accel. Config Type	num gpus	per device batch size	peak nvidia mem reserved (GiB)	peak torch mem alloc (GiB)	torch mem alloc (GiB)	throughput (toks/sec)
NousResearch/Llama-2-70b-hf	lora	none	2	4	OOM	OOM	OOM	OOM
NousResearch/Llama-2-70b-hf	lora	accelerated-peft-bnb	2	4	OOM	OOM	OOM	OOM
TheBloke/Llama-2-70B-GPTQ	lora	accelerated-peft-autogptq	2	4	78.48	70.67	18.65	451

fabianlim

will be good to put a link to this issue soemwhere in the code so that we can view the gpu log schema.

scripts/benchmarks/benchmark.py

mory using nvidia-smi and changed aggregation function in result collation

fabianlim · 2024-05-20T10:44:29Z

can you run tox -e lint from the top level directory, the linting is not automated yet #7 . Also do we plan to activate the memory by default in run_benchmarks.sh?

scripts/benchmarks/benchmark.py

achew010 · 2024-05-21T03:21:02Z

can you run tox -e lint from the top level directory, the linting is not automated yet #7 . Also do we plan to activate the memory by default in run_benchmarks.sh?

@fabianlim
okay linted. Do we want to set nvidia-smi to be the default mem logging approach or once we establish that the speed degradation is insignificant with using the HF memory logging API to use this instead?

fabianlim · 2024-05-21T04:55:51Z

okay linted. Do we want to set nvidia-smi to be the default mem logging approach or once we establish that the speed degradation is insignificant with using the HF memory logging API to use this instead?

@achew010 lets merge that commit on top of this one and let me review.

…ith explanation of the 2 memory benchmarking options

scripts/benchmarks/README.md

scripts/benchmarks/benchmark.py

scripts/benchmarks/README.md

scripts/benchmarks/benchmark.py

fabianlim · 2024-05-24T13:52:11Z

@achew010 i approived. after we update the csv we can merge. Also can you run a tox -e lint

scripts/benchmarks/refs/a100_80gb.csv

fabianlim · 2024-05-27T07:44:23Z

We have noted the memory keys should be renamed; to be addressed later in #19

…or GPTQ-LoRA (#20) * Add GitHub Workflow for Linting , Formatting and Test. Activate Workflow for Framework (#7) * add lint workflow Signed-off-by: Yu Chin Fabian Lim <[email protected]> * add pylintrc, update .tox fix files Signed-off-by: Yu Chin Fabian Lim <[email protected]> * activate test and minor fix Signed-off-by: Yu Chin Fabian Lim <[email protected]> * lint benchmarks.py and add workflow to dev Signed-off-by: Yu Chin Fabian Lim <[email protected]> --------- Signed-off-by: Yu Chin Fabian Lim <[email protected]> * Improvements to Benchmark Scripts and Config Generation Workflow (#13) * fix benches and add verify configs Signed-off-by: Yu Chin Fabian Lim <[email protected]> * update readme and add workflow Signed-off-by: Yu Chin Fabian Lim <[email protected]> * add packaging dep Signed-off-by: Yu Chin Fabian Lim <[email protected]> * update torch dep in framework and run-benches Signed-off-by: Yu Chin Fabian Lim <[email protected]> * take host env in run-benches * add display bench results script * rename summary.csv to raw_summary.csv and update run_benchmarks.sh * export environment variables in shell command * dump out pip requirements for repro, and add default FHT_branch --------- Signed-off-by: Yu Chin Fabian Lim <[email protected]> * Added support for running official HF baseline FSDP-QLoRA benchmark (#16) * new baseline scenario * rename variables * added warning when plugin allows SFTTrainer to handle PEFT on single device * Fix FSDP when performing GPTQ-LoRA with Triton V2 (#15) * wrap in parameters and torch view to correct dtype Signed-off-by: Yu Chin Fabian Lim <[email protected]> * refactor to apply patch only on FSDP and simplify Signed-off-by: Yu Chin Fabian Lim <[email protected]> --------- Signed-off-by: Yu Chin Fabian Lim <[email protected]> * Provide Memory Benchmarking Feature to Benchmarking Code (#14) * add gpu memory logging support * made improvements to GPU reference and result collation * Renamed memory logging argument to reflect its readings as reserved me mory using nvidia-smi and changed aggregation function in result collation * variable renames * manual linting * added memory logging functionality via HFTrainer * added support to benchmark memory using HFTrainer and updated READMEwith explanation of the 2 memory benchmarking options * addressed changes requested in PR #14 * fix bug and smplify gpu logs aggregation logic * fixes to calculation of HFTrainer Mem Logging values * fix calculations * more fixes * fix to ignore including stage inside max calculation of alloc memory * more comments and README updates * added fix to keyerror due to empty output dict from OOM * manual linting * added benchmark results to refs * remove unnecessary columns in results gathering * made changes to results gathering --------- Signed-off-by: Yu Chin Fabian Lim <[email protected]> Co-authored-by: achew010 <[email protected]>

fabianlim · 2024-05-30T15:16:39Z

@achew010 can we move all the memory computation logic out of write_result into gather_report. That way the results.json only holds the raw data. Then gather_report can hold all the logic to preprocess the data for human consumption.

fabianlim · 2024-06-01T01:40:25Z

@achew010 also one more consideration is that memory we should only have huggingface mem probes in the benchmark.csv. ~~This is because the command.sh cannot easily replay the nvidia-smi measurements.~~ Actually maybe there are more issues, because results.json is not even properly populated by command.sh.

Or unless we have the tool do a proper replay and start the nvidia-smi properly. Update addressed in below commit.

…l-stack#14

* refactor Signed-off-by: Yu Chin Fabian Lim <[email protected]> * fixes Signed-off-by: Yu Chin Fabian Lim <[email protected]> * refactor mistral Signed-off-by: Yu Chin Fabian Lim <[email protected]> * add mixtral Signed-off-by: Yu Chin Fabian Lim <[email protected]> * some refactoring after introducing mlp Signed-off-by: Yu Chin Fabian Lim <[email protected]> * remove extranous files Signed-off-by: Yu Chin Fabian Lim <[email protected]> * add bnb Signed-off-by: Yu Chin Fabian Lim <[email protected]> * lint + fmt and improvements to readme Signed-off-by: Yu Chin Fabian Lim <[email protected]> * bench fixes * need to handle lora adapters device due to #26 * allow replay of failed benches, addressing comment in #14 * update benches (remove l40) --------- Signed-off-by: Yu Chin Fabian Lim <[email protected]>

achew010 changed the base branch from main to dev May 17, 2024 07:02

achew010 marked this pull request as ready for review May 17, 2024 08:45

achew010 requested a review from fabianlim as a code owner May 17, 2024 08:45

fabianlim requested changes May 17, 2024

View reviewed changes

scripts/benchmarks/benchmark.py Outdated Show resolved Hide resolved

scripts/benchmarks/benchmark.py Outdated Show resolved Hide resolved

fabianlim reviewed May 18, 2024

View reviewed changes

scripts/benchmarks/benchmark.py Outdated Show resolved Hide resolved

fabianlim reviewed May 20, 2024

View reviewed changes

scripts/benchmarks/benchmark.py Outdated Show resolved Hide resolved

achew010 force-pushed the memory-benchmarks branch from ed4ae05 to d4dae06 Compare May 20, 2024 10:32

achew010 added 3 commits May 20, 2024 10:36

add gpu memory logging support

9589ee3

made improvements to GPU reference and result collation

680c5f7

Renamed memory logging argument to reflect its readings as reserved me

a0aa069

mory using nvidia-smi and changed aggregation function in result collation

achew010 force-pushed the memory-benchmarks branch from d4dae06 to a0aa069 Compare May 20, 2024 10:37

fabianlim reviewed May 21, 2024

View reviewed changes

scripts/benchmarks/benchmark.py Outdated Show resolved Hide resolved

achew010 added 2 commits May 21, 2024 03:04

variable renames

7054aa6

manual linting

b3ff020

achew010 added 2 commits May 21, 2024 07:04

added memory logging functionality via HFTrainer

97352b1

added support to benchmark memory using HFTrainer and updated READMEw…

44e6100

…ith explanation of the 2 memory benchmarking options

fabianlim requested changes May 21, 2024

View reviewed changes

achew010 added 2 commits May 21, 2024 09:02

addressed changes requested in PR foundation-model-stack#14

1baf6aa

fix bug and smplify gpu logs aggregation logic

56db51d

fabianlim linked an issue May 23, 2024 that may be closed by this pull request

Add GPU measurements to Benchmark Script #8

Closed

fixes to calculation of HFTrainer Mem Logging values

4ba9b93

fabianlim reviewed May 23, 2024

View reviewed changes

scripts/benchmarks/benchmark.py Outdated Show resolved Hide resolved

achew010 added 2 commits May 23, 2024 11:51

fix calculations

20d872e

more fixes

381a191

fabianlim reviewed May 23, 2024

View reviewed changes

scripts/benchmarks/benchmark.py Outdated Show resolved Hide resolved

fabianlim reviewed May 23, 2024

View reviewed changes

scripts/benchmarks/benchmark.py Show resolved Hide resolved

achew010 added a commit to achew010/fms-acceleration that referenced this pull request May 24, 2024

addressed changes requested in PR foundation-model-stack#14

feb905a

achew010 added 3 commits May 24, 2024 07:58

fix to ignore including stage inside max calculation of alloc memory

b3e8c56

more comments and README updates

b6f7519

added fix to keyerror due to empty output dict from OOM

2ee8902

fabianlim approved these changes May 24, 2024

View reviewed changes

achew010 added 2 commits May 24, 2024 18:11

manual linting

4b92b5d

added benchmark results to refs

f39d03c

fabianlim reviewed May 27, 2024

View reviewed changes

scripts/benchmarks/refs/a100_80gb.csv Outdated Show resolved Hide resolved

fabianlim reviewed May 27, 2024

View reviewed changes

scripts/benchmarks/refs/a100_80gb.csv Outdated Show resolved Hide resolved

achew010 added 2 commits May 27, 2024 06:22

remove unnecessary columns in results gathering

04702cc

made changes to results gathering

5de954f

fabianlim merged commit f1895b7 into foundation-model-stack:dev May 27, 2024
2 checks passed

achew010 deleted the memory-benchmarks branch May 29, 2024 02:14

achew010 mentioned this pull request May 29, 2024

Workaround Low-Mem-Mode Patch for GPTQ-LoRA #26

Merged

fabianlim added a commit to fabianlim/fms-acceleration that referenced this pull request Jun 1, 2024

allow replay of failed benches, addressing comment in foundation-mode…

77bc92b

…l-stack#14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide Memory Benchmarking Feature to Benchmarking Code #14

Provide Memory Benchmarking Feature to Benchmarking Code #14

achew010 commented May 17, 2024 •

edited

Loading

fabianlim left a comment

fabianlim commented May 20, 2024 •

edited

Loading

achew010 commented May 21, 2024 •

edited

Loading

fabianlim commented May 21, 2024 •

edited

Loading

fabianlim commented May 24, 2024 •

edited

Loading

fabianlim commented May 27, 2024

fabianlim commented May 30, 2024

fabianlim commented Jun 1, 2024 •

edited

Loading

Provide Memory Benchmarking Feature to Benchmarking Code #14

Provide Memory Benchmarking Feature to Benchmarking Code #14

Conversation

achew010 commented May 17, 2024 • edited Loading

Description

Usage

1. Nvidia's SMI CLI tool

2. Torch CUDA through Huggingface's HFTrainer's API

3. Log Both

4. Difference between Nvidia-SMI Utility and Torch CUDA through HFTrainer API

1. The Nvidia-SMI Utility is a coarse measurement tool that captures anything takes up GPU memory. It is simple and non-intrusive as it doesn't involve probing the trainer. It uses the NVML library to fetch reserved memory for each device ID -

2. The HFTrainer API is a more precise tool that logs memory usage for a couple of operations inside HFTrainer

Note: Details on Memory Calculations from HFTrainer for GPTQ-LoRA + FSDP

Tests

Memory Measurement Accuracy and Potential Side Effects

1. No Significant Slowdown From Using HFTrainer Memory Probes API on QLoRA Training

A. <100 toks/sec slowdown after introducing the memory probes for Mistral,

B. <100 toks/sec slowdown after introducing the memory probes for Mixtral

2. Torch/HF shows more granularity of memory usage with peak memory and actual allocated memory than Nvidia's reserved memory. This is more helpful when analyzing the actual memory allocated for each model.

3. Memory Usage Decreases on Distributed Finetuning

Benchmarks

5. We observe that on single device finetuning for larger models (e.g. 49B Mixtral), PEFT begins to run out of memory while the Quantized PEFT methods continue to maintain low memory consumption.

6. In distributed finetuning for large models like Llama2-70B, GPTQ-LoRA shows the lowest memory consumption with the same throughput.

7. Increasing the batch size, GPTQ-LoRA is the only experiment that doesnt run out of memory.

fabianlim left a comment

Choose a reason for hiding this comment

fabianlim commented May 20, 2024 • edited Loading

achew010 commented May 21, 2024 • edited Loading

fabianlim commented May 21, 2024 • edited Loading

fabianlim commented May 24, 2024 • edited Loading

fabianlim commented May 27, 2024

fabianlim commented May 30, 2024

fabianlim commented Jun 1, 2024 • edited Loading

achew010 commented May 17, 2024 •

edited

Loading

fabianlim commented May 20, 2024 •

edited

Loading

achew010 commented May 21, 2024 •

edited

Loading

fabianlim commented May 21, 2024 •

edited

Loading

fabianlim commented May 24, 2024 •

edited

Loading

fabianlim commented Jun 1, 2024 •

edited

Loading