Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: IngestionPipeline Cache does not work with num_workers > 1 #16605

Open
main-2983 opened this issue Oct 18, 2024 · 2 comments
Open

[Bug]: IngestionPipeline Cache does not work with num_workers > 1 #16605

main-2983 opened this issue Oct 18, 2024 · 2 comments
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized

Comments

@main-2983
Copy link

Bug Description

When num_workers > 1, the llama_cache file is empty. When num_workers=1, the IngestionPipeline can cache normally

Version

0.11.17

Steps to Reproduce

if __name__ == '__main__':
    pipeline = IngestionPipeline(
        transformations=[
            FooNodeParser(),
            OpenAIEmbedding(
                model="text-embedding-3-small"
            )
        ],
        vector_store=vector_store
    )
    documents = SimpleDirectoryReader(path).load_data(
            show_progress=True, num_workers=4)
    nodes = pipeline.run(documents=documents, num_workers=4)
    pipeline.persist("cache")

Relevant Logs/Tracbacks

No response

@main-2983 main-2983 added bug Something isn't working triage Issue needs to be triaged/prioritized labels Oct 18, 2024
Copy link

dosubot bot commented Oct 18, 2024

The issue you're experiencing with the IngestionPipeline cache not working when num_workers is greater than 1 might be related to how the cache is accessed and managed across multiple processes. When num_workers is set to more than 1, the pipeline uses a ProcessPoolExecutor to run transformations in parallel, and each process can access the cache independently if it is not disabled [1].

However, there might be specific conditions or configurations in your setup that could lead to an empty cache file when using multiple workers, but I couldn't find specific details on this in the repository. It might be worth checking if the cache is being correctly initialized and accessed in each process, and ensuring that the cache is not being inadvertently disabled or cleared during the process execution.

To continue talking to Dosu, mention @dosu.

@logan-markewich
Copy link
Collaborator

I need to remove this parameter, it doesn't work half the time, and tbh is usually broken

set it to 1. Or use async with pipeline.arun() to achieve concurrency. Or multi-thread it outside of the class in your code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized
Projects
None yet
Development

No branches or pull requests

2 participants