Unable to get `kaggle_criteo_weekly.txt` #1

laekov · 2024-02-18T03:10:20Z

Hi. I read your paper and find your ideas interesting. Thank you for opening your source code.

However, when I try to run the Oracle Cacher, I cannot find an indication on how to get the kaggle_criteo_weekly.txt that is required by --processed-csv. Can you please give me some instructions on how to generate that file from the criteo kaggle or terabytes datasets?

Also, I saw that your CSVLoader uses orjson to parse every line. So, I am confused whether it is actually a CSV file or JSONL file? (I have both csv and npz versions of the terabytes dataset. But neither of them seems to work for that argument.)

The text was updated successfully, but these errors were encountered:

iidsample · 2024-02-19T02:01:20Z

Hey,
We use split jsonl files with pre-processing enabled.
Each line is an example of the dataset.
An example line is -
{"label":1.0,"dense":[2.5649492740631104,3.044522523880005,1.3862943649291992,1.3862943649291992,1.0986123085021973,1.3862943649291992,2.70805025100708,3.7841897010803223,3.8712010383605957,1.0986123085021973,1.3862943649291992,0.0,1.0986123085021973],"sparse":[20,201,3138,2411,0,1,735,1,0,696,153,3017,145,2,2955,2749,0,1585,0,3,2888,0,1,1581,4,335]}

laekov · 2024-02-19T03:02:40Z

Should the sparse features be converted from the 32 bit hex IDs to contiguous indicies? (similar to the day_X_processed.npz for TorchRec)

iidsample · 2024-02-19T03:13:23Z

So I have forgotten what Torchrec needs. We convert the hex ids to integers, where unique ids are assigned a unique integer. I am happy to share pre-processed data if it helps you.

laekov · 2024-02-19T03:15:37Z

I am happy to share pre-processed data if it helps you.

Sure. That would be great!

It will also help if you can share with me your script to create the JSONL from npz or the raw dataset.

iidsample · 2024-02-19T03:26:26Z

Okay, share your email, I can send you a link to download data.

laekov · 2024-02-19T03:30:16Z

My email is '[email protected]'

Thanks

iidsample · 2024-02-19T03:38:54Z

Shared the data file. Replace the csv processed file with the folder I have shared with you.

laekov · 2024-02-19T03:43:14Z

Get. I will have a look. Thank you for your help

iidsample self-assigned this Feb 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to get `kaggle_criteo_weekly.txt` #1

Unable to get `kaggle_criteo_weekly.txt` #1

laekov commented Feb 18, 2024 •

edited

Loading

iidsample commented Feb 19, 2024

laekov commented Feb 19, 2024

iidsample commented Feb 19, 2024 •

edited

Loading

laekov commented Feb 19, 2024

iidsample commented Feb 19, 2024

laekov commented Feb 19, 2024

iidsample commented Feb 19, 2024

laekov commented Feb 19, 2024

Unable to get kaggle_criteo_weekly.txt #1

Unable to get kaggle_criteo_weekly.txt #1

Comments

laekov commented Feb 18, 2024 • edited Loading

iidsample commented Feb 19, 2024

laekov commented Feb 19, 2024

iidsample commented Feb 19, 2024 • edited Loading

laekov commented Feb 19, 2024

iidsample commented Feb 19, 2024

laekov commented Feb 19, 2024

iidsample commented Feb 19, 2024

laekov commented Feb 19, 2024

Unable to get `kaggle_criteo_weekly.txt` #1

Unable to get `kaggle_criteo_weekly.txt` #1

laekov commented Feb 18, 2024 •

edited

Loading

iidsample commented Feb 19, 2024 •

edited

Loading