Add Rclone-Cloudflare download instructiosn to README.md

ctuning · Feb 21, 2024 · dc94ae3 · dc94ae3
1 parent 5d0c221
commit dc94ae3
Showing 1 changed file with 25 additions and 20 deletions.
diff --git a/recommendation/dlrm_v2/pytorch/README.md b/recommendation/dlrm_v2/pytorch/README.md
@@ -67,18 +67,15 @@ cd $HOME/mlcommons/inference/loadgen
 CFLAGS="-std=c++14" python setup.py develop --user
 ```
 
+
 ### Downloading model weights
 
-File name | framework | Size in bytes (`du *`) | MD5 hash (`md5sum *`)
--|-|-|-
+framework | Size in bytes (`du *`) | MD5 hash (`md5sum *`)
+-|-|-
 N/A | pytorch | <2GB | -
-[weight_sharded](https://cloud.mlcommons.org/index.php/s/XzfSeLgW8FYfR3S/download) | pytorch | 97.31GB | -
+ pytorch | 97.31GB | -
 
-You can download the weights by running:
-```
-wget https://cloud.mlcommons.org/index.php/s/XzfSeLgW8FYfR3S/download -O weights.zip
-unzip weights.zip
-```
+#### CM method
 
 The following MLCommons CM commands can be used to programmatically download the model checkpoint. 
 
@@ -88,24 +85,32 @@ cm pull repo mlcommons@ck
 cm run script --tags=get,ml-model,dlrm,_pytorch,_weight_sharded,_rclone -j
 ```
 
-(optional) To speed up future downloads, we recommend you save the weights in a bucket (E.g GCP, AWS). For example, after saving the checkpoint in a GCP bucket, you can download the weights faster by running:
+#### Manual method
+
+The above command automatically runs a set of Rclone commands to download the data from a Cloudflare R2 bucket. However, if you'd like to run the Rclone commands manually, you can do so as follows:
+
+To run Rclone on Windows, you can download the executable [here](https://rclone.org/install/#windows).
+To install Rclone on Linux/macOS/BSD systems, run:
 ```
-export BUCKET_NAME=<BUCKET_CONTAINING_MODEL>
-cd $HOME/mlcommons/inference/recommendation/dlrm_v2/pytorch/model/
-gsutil -m cp -r "gs://$BUCKET_NAME/model_weights/*" .
+sudo -v ; curl https://rclone.org/install.sh | sudo bash
 ```
+Once Rclone is installed, run the following command to authenticate with the bucket:
+```
+rclone config create mlc-inference s3 provider=Cloudflare access_key_id=f65ba5eef400db161ea49967de89f47b secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
+```
+You can then navigate in the terminal to your desired download directory and run the following commands to download the model weights:
 
-### Downloading dataset
-| Original dataset | download link |
-| ---- | ---- |
-| Criteo Terabyte (day 23) | https://labs.criteo.com/2013/12/download-terabyte-click-logs/ |
+```
+rclone copy mlc-inference:mlcommons-inference-wg-public/model_weights ./model_weights -P
+```
 
+#### (optional) 
 
-1. The Criteo fake dataset can be created in place of the real datasets in order to facilitate debugging and testing. We provide a fake (random) data generator that can be used to quickly generate data samples in a format compatible with the original dataset. Please use the following script in `./tools` to quickly create random samples for the corresponding models, which will be placed into `./fake_criteo` directory
+To speed up future downloads, we recommend you save the weights in a bucket (E.g GCP, AWS). For example, after saving the checkpoint in a GCP bucket, you can download the weights faster by running:
 ```
-./make_fake_criteo.sh
-mv ./fake_criteo .. && cd ..
-export DATA_DIR=./fake_criteo
+export BUCKET_NAME=<BUCKET_CONTAINING_MODEL>
+cd $HOME/mlcommons/inference/recommendation/dlrm_v2/pytorch/model/
+gsutil -m cp -r "gs://$BUCKET_NAME/model_weights/*" .
 ```