Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to get kaggle_criteo_weekly.txt #1

Open
laekov opened this issue Feb 18, 2024 · 8 comments
Open

Unable to get kaggle_criteo_weekly.txt #1

laekov opened this issue Feb 18, 2024 · 8 comments
Assignees

Comments

@laekov
Copy link

laekov commented Feb 18, 2024

Hi. I read your paper and find your ideas interesting. Thank you for opening your source code.

However, when I try to run the Oracle Cacher, I cannot find an indication on how to get the kaggle_criteo_weekly.txt that is required by --processed-csv. Can you please give me some instructions on how to generate that file from the criteo kaggle or terabytes datasets?

Also, I saw that your CSVLoader uses orjson to parse every line. So, I am confused whether it is actually a CSV file or JSONL file? (I have both csv and npz versions of the terabytes dataset. But neither of them seems to work for that argument.)

@iidsample iidsample self-assigned this Feb 19, 2024
@iidsample
Copy link
Collaborator

Hey,
We use split jsonl files with pre-processing enabled.
Each line is an example of the dataset.
An example line is -
{"label":1.0,"dense":[2.5649492740631104,3.044522523880005,1.3862943649291992,1.3862943649291992,1.0986123085021973,1.3862943649291992,2.70805025100708,3.7841897010803223,3.8712010383605957,1.0986123085021973,1.3862943649291992,0.0,1.0986123085021973],"sparse":[20,201,3138,2411,0,1,735,1,0,696,153,3017,145,2,2955,2749,0,1585,0,3,2888,0,1,1581,4,335]}

@laekov
Copy link
Author

laekov commented Feb 19, 2024

Should the sparse features be converted from the 32 bit hex IDs to contiguous indicies? (similar to the day_X_processed.npz for TorchRec)

@iidsample
Copy link
Collaborator

iidsample commented Feb 19, 2024

So I have forgotten what Torchrec needs. We convert the hex ids to integers, where unique ids are assigned a unique integer. I am happy to share pre-processed data if it helps you.

@laekov
Copy link
Author

laekov commented Feb 19, 2024

I am happy to share pre-processed data if it helps you.

Sure. That would be great!

It will also help if you can share with me your script to create the JSONL from npz or the raw dataset.

@iidsample
Copy link
Collaborator

Okay, share your email, I can send you a link to download data.

@laekov
Copy link
Author

laekov commented Feb 19, 2024

My email is '[email protected]'

Thanks

@iidsample
Copy link
Collaborator

Shared the data file. Replace the csv processed file with the folder I have shared with you.

@laekov
Copy link
Author

laekov commented Feb 19, 2024

Get. I will have a look. Thank you for your help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants