Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Object/Blob store capabilities #467

Open
elijahbenizzy opened this issue Dec 11, 2024 · 0 comments
Open

Object/Blob store capabilities #467

elijahbenizzy opened this issue Dec 11, 2024 · 0 comments

Comments

@elijahbenizzy
Copy link
Contributor

Is your feature request related to a problem? Please describe.
It is common to hold a reference to some file/large object in state. This can be, for example:

  1. A PDF for RAG/ingestion
  2. An excel document that we're generating
  3. A large dataframe that was pulled from S3

The problem is that this can really crowd out the state, if we JSON serialize it. E>G> a large pdf could be mbs of base-64 data each time, and as state is currently saved in its entirety after every node execution, this can get even worse.

Describe the solution you'd like

A few ideas, still thinking through:

Push to the serialization layer

This is the current approach -- we actually do this for pandas dataframes. To do this effectively, the user would be a CachingObject data type and register serialization capabilities for it. Pydantic also could do this easily.

This would look something like this -- this is an immutable reference file (E.G. one that is passed in or generated externally). Note you'd have extra steps if you were generating a file.

@dataclass
class ReferenceCachingFile:
    path: str
    contents_hash: str = None
    contents: Optional[str] = None 

    def get() -> str:
        if self.contents is not None:
            return self.contents
        self.contents  = self.load()
        return self.contents

    def load() -> str:
         with open(path, "r") as f:
            contents = f.read()
        # TODO -- compute + verify hash, TBD how much verification we'd want
        return contents

        
       
@serde.serialize.register(ReferenceCachingFile)
def serialize_reference_caching_file(value: ReferenceCachingFile):
    return {"path" value.path, "contents_hash": value.contents_hash, serde.KEY: "reference_caching_file"}

@serde.deserializer.register("reference_caching_file")
def deserialize_myclass(value: dict, myclass_kwargs: dict = None, **kwargs) -> cls:
    out = ReferenceCachingFile(**value)
    out.contents = out.load()

This is just an illustration -- it's missing two things:

  1. Generic pluggability to load from anywhere
  2. Saving capability as well in the case of generating files

Some options:

  1. Have mixins for saveable/writable with the right functions
  2. Keep these as part of the framework -- then just have people implement their own with a few in plugins/subclas

Push to the persistence layer

I think there's room for a o9bject_store abstraction here. Just like we have a persister, we could also have a blob store. E.G.

app = Application()
    .with_state_persister(...)
    .with_object_store(FileSystemStore())
    ...

Then the user could specify some sort of CachingFile (for lack of a better name) like we had above:

T = TypeVar("ObjectType")

class CachingFile(abc.ABC):
    contents: T 
    contents_hash: str
    location_data: dict # URI? TBD what extra data we need...
    
    @abc.abstractmethod
    def to_bytes() -> bytes:
        ...

    @abc.abstractmethod
    def from_bytes() -> T:
        ...

The blob store would then interact with this -- calling to_bytes/from_bytes and comparing the hash/saving at the hash location. Then the user can implement a CachingFile or use one of the plugins (pandas, pdf, etc...)... THe db would store the serialized/deserialized version, and then delegate to the blob store when rehydrating or saving state.

Some other things to flesh out:

  1. How to ensure we don't save it twice (the hash -- will need to put this in the interface...)
  2. When it would be loaded. Dynamically? On startup? Configurable?

Describe alternatives you've considered
The first one above is easiest to do now -- this is how it works, and we do something similar for pandas, a bit messy though.

Additional context
Milan J came to office hours today asking about this, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant