Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dump_bin DumpDataUpdate mode append data error #1818

Open
hbhuyt opened this issue Jun 27, 2024 · 1 comment
Open

dump_bin DumpDataUpdate mode append data error #1818

hbhuyt opened this issue Jun 27, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@hbhuyt
Copy link

hbhuyt commented Jun 27, 2024

🐛 Bug Description

At first, I used Dump_bin's DumpDataAll mode to import data it worked fine.
Part of the imported data is as follows
df[df['instrument']=='SH600306']
Out[35]:
instrument datetime $volume $factor $close
41691 SH600306 2024-04-23 1022018.0 0.281253 0.686257
41692 SH600306 2024-04-24 1372334.0 0.281253 0.652507
41693 SH600306 2024-04-25 951008.0 0.281253 0.618756
41694 SH600306 2024-04-26 1968818.0 0.281253 0.587818
41695 SH600306 2024-04-29 1532764.0 0.281253 0.559693

But when I append new data with DumpDataUpdate, there is an error.
The original data is as follows
dfraw.loc[(dfraw['date']>'2024-04-29'),['instrument','date','close']]
Out[54]:
instrument date close
4356 SH600306 2024-05-29 0.098438
4357 SH600306 2024-05-30 0.092813
4358 SH600306 2024-05-31 0.101251
4359 SH600306 2024-06-03 0.092813
4360 SH600306 2024-06-04 0.095626
4361 SH600306 2024-06-05 0.092813
4362 SH600306 2024-06-06 0.092813
4363 SH600306 2024-06-07 0.095626
4364 SH600306 2024-06-11 0.090001
4365 SH600306 2024-06-12 0.090001
4366 SH600306 2024-06-13 0.087188
4367 SH600306 2024-06-14 0.081563

Some of the imported data is shown below

dfnew[dfnew.instrument=='SH600306']
Out[8]:
instrument datetime $volume $factor $close
10288 SH600306 2024-04-22 363992.0 0.281253 0.722820
10289 SH600306 2024-04-23 1022018.0 0.281253 0.686257
10290 SH600306 2024-04-24 1372334.0 0.281253 0.652507
10291 SH600306 2024-04-25 951008.0 0.281253 0.618756
10292 SH600306 2024-04-26 1968818.0 0.281253 0.587818
10293 SH600306 2024-04-29 1532764.0 0.281253 0.559693
10294 SH600306 2024-04-30 188390272.0 0.281253 0.098438
10295 SH600306 2024-05-06 117053368.0 0.281253 0.092813
10296 SH600306 2024-05-07 99965448.0 0.281253 0.101251
10297 SH600306 2024-05-08 85975896.0 0.281253 0.092813
10298 SH600306 2024-05-09 46003664.0 0.281253 0.095626
10299 SH600306 2024-05-10 61825620.0 0.281253 0.092813
10300 SH600306 2024-05-13 26138518.0 0.281253 0.092813
10301 SH600306 2024-05-14 19884768.0 0.281253 0.095626
10302 SH600306 2024-05-15 24197052.0 0.281253 0.090001
10303 SH600306 2024-05-16 12483558.0 0.281253 0.090001
10304 SH600306 2024-05-17 9390678.0 0.281253 0.087188
10305 SH600306 2024-05-20 27141916.0 0.281253 0.081563

I am hoping to debug dump_bin.py to find the problem. I ran it to here,the following code may be problem.

    def _data_to_bin(self, df: pd.DataFrame, calendar_list: List[pd.Timestamp], features_dir: Path):
        if df.empty:
            logger.warning(f"{features_dir.name} data is None or empty")
            return
        if not calendar_list:
            logger.warning("calendar_list is empty")
            return
        # align index
        _df = self.data_merge_calendar(df, calendar_list)
        if _df.empty:
            logger.warning(f"{features_dir.name} data is not in calendars")
            return

When align index, calendar_list does not contain dates such as 2024-05-06, but SH600306 data is empty in these days.

@hbhuyt hbhuyt added the bug Something isn't working label Jun 27, 2024
@SunsetWolf
Copy link
Collaborator

My guess is that your data is not normalized causing this issue. I tried using the command:
python scripts/get_data.py qlib_data --target_dir <user data dir> --region cn
Download the data, confirm that SH600306 exists in this data, and then use the command:
python scripts/data_collector/yahoo/collector.py update_data_to_bin --qlib_data_1d_dir <user data dir> --end_date <end date>
Performing an incremental update on the downloaded data did not happen as you described. It is recommended to use this method for incremental updates to the data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants