-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
write_parquet
followed by read_parquet
may lead to a deadlock
#19380
Comments
Can you provide a somewhat reproducible example of this happening? |
The following codes is hanged: import torchdata.datapipes as dp
import polars as pl
from torchdata.dataloader2 import DataLoader2, MultiProcessingReadingService
class DataSet(dp.iter.IterDataPipe):
def __init__(self):
super().__init__()
df = pl.DataFrame(
{
'a': range(10),
'b': range(10),
'c': range(10),
}
)
df.write_parquet("test.parquet")
print(">>> Save parquet")
def __iter__(self):
df = pl.read_parquet("test.parquet")
print(">>> Load parquet")
for row in df.iter_rows():
print(">>> Yield row")
yield row
dataloader = DataLoader2(
DataSet(),
reading_service=MultiProcessingReadingService(num_workers=4)
)
for row in dataloader:
print(row) Outputs:
The following code is regular:
Outputs: >>> Load parquet
>>> Yield row
>>> Yield row
>>> Yield row
>>> Yield row
>>> Yield row
>>> Yield row
>>> Yield row
>>> Load parquet
>>> Yield row
>>> Yield row
>>> Yield row
>>> Load parquet
>>> Load parquet
>>> Yield row
>>> Yield row
>>> Yield row
>>> Yield row
>>> Yield row
>>> Yield row
>>> Yield row
>>> Yield row
>>> Yield row
>>> Yield row
>>> Yield row
>>> Yield row
>>> Yield row
>>> Yield row
>>> Yield row
>>> Yield row
>>> Yield row
>>> Yield row
>>> Yield row
>>> Yield row
>>> Yield row
>>> Yield row
>>> Yield row
>>> Yield row
>>> Yield row
>>> Yield row
>>> Yield row
>>> Yield row
>>> Yield row
>>> Yield row
(0, 0, 0)
(0, 0, 0)
(0, 0, 0)
(0, 0, 0)
(1, 1, 1)
(1, 1, 1)
(1, 1, 1)
(1, 1, 1)
(2, 2, 2)
(2, 2, 2)
(2, 2, 2)
(2, 2, 2)
(3, 3, 3)
(3, 3, 3)
(3, 3, 3)
(3, 3, 3)
(4, 4, 4)
(4, 4, 4)
(4, 4, 4)
(4, 4, 4)
(5, 5, 5)
(5, 5, 5)
(5, 5, 5)
(5, 5, 5)
(6, 6, 6)
(6, 6, 6)
(6, 6, 6)
(6, 6, 6)
(7, 7, 7)
(7, 7, 7)
(7, 7, 7)
(7, 7, 7)
(8, 8, 8)
(8, 8, 8)
(8, 8, 8)
(8, 8, 8)
(9, 9, 9)
(9, 9, 9)
(9, 9, 9)
(9, 9, 9) |
Besides, if the number_workers is set to 0:
Outputs:
|
You are using multiprocessing. That is likely the source of your deadlocks: |
Thanks! This answers the raising of deadlocks:
Somewhat differently, in my case the file lock is caused by Thanks again! |
Checks
Reproducible example
Log output
Issue description
I use polars to preprocess the .csv data and write it into parquet files. However, I found if I immediately read this file (using read_parquet), the process will be hanged in case multiprocess sampling (num_workers > 0) is adopted in DataLoader (torch).
Moreover, when I restart the script again (no need to process the data), the data loading process can be regular.
Hence, maybe something after
write_parquet
is not completely terminated?Thanks!
Expected behavior
None.
Installed versions
The text was updated successfully, but these errors were encountered: