Object/Blob store capabilities #467

elijahbenizzy · 2024-12-11T21:52:15Z

Is your feature request related to a problem? Please describe.
It is common to hold a reference to some file/large object in state. This can be, for example:

A PDF for RAG/ingestion
An excel document that we're generating
A large dataframe that was pulled from S3

The problem is that this can really crowd out the state, if we JSON serialize it. E>G> a large pdf could be mbs of base-64 data each time, and as state is currently saved in its entirety after every node execution, this can get even worse.

Describe the solution you'd like

A few ideas, still thinking through:

Push to the serialization layer

This is the current approach -- we actually do this for pandas dataframes. To do this effectively, the user would be a CachingObject data type and register serialization capabilities for it. Pydantic also could do this easily.

This would look something like this -- this is an immutable reference file (E.G. one that is passed in or generated externally). Note you'd have extra steps if you were generating a file.

@dataclass
class ReferenceCachingFile:
    path: str
    contents_hash: str = None
    contents: Optional[str] = None 

    def get() -> str:
        if self.contents is not None:
            return self.contents
        self.contents  = self.load()
        return self.contents

    def load() -> str:
         with open(path, "r") as f:
            contents = f.read()
        # TODO -- compute + verify hash, TBD how much verification we'd want
        return contents

        
       
@serde.serialize.register(ReferenceCachingFile)
def serialize_reference_caching_file(value: ReferenceCachingFile):
    return {"path" value.path, "contents_hash": value.contents_hash, serde.KEY: "reference_caching_file"}

@serde.deserializer.register("reference_caching_file")
def deserialize_myclass(value: dict, myclass_kwargs: dict = None, **kwargs) -> cls:
    out = ReferenceCachingFile(**value)
    out.contents = out.load()

This is just an illustration -- it's missing two things:

Generic pluggability to load from anywhere
Saving capability as well in the case of generating files

Some options:

Have mixins for saveable/writable with the right functions
Keep these as part of the framework -- then just have people implement their own with a few in plugins/subclas

Push to the persistence layer

I think there's room for a o9bject_store abstraction here. Just like we have a persister, we could also have a blob store. E.G.

app = Application()
    .with_state_persister(...)
    .with_object_store(FileSystemStore())
    ...

Then the user could specify some sort of CachingFile (for lack of a better name) like we had above:

T = TypeVar("ObjectType")

class CachingFile(abc.ABC):
    contents: T 
    contents_hash: str
    location_data: dict # URI? TBD what extra data we need...
    
    @abc.abstractmethod
    def to_bytes() -> bytes:
        ...

    @abc.abstractmethod
    def from_bytes() -> T:
        ...

The blob store would then interact with this -- calling to_bytes/from_bytes and comparing the hash/saving at the hash location. Then the user can implement a CachingFile or use one of the plugins (pandas, pdf, etc...)... THe db would store the serialized/deserialized version, and then delegate to the blob store when rehydrating or saving state.

Some other things to flesh out:

How to ensure we don't save it twice (the hash -- will need to put this in the interface...)
When it would be loaded. Dynamically? On startup? Configurable?

Describe alternatives you've considered
The first one above is easiest to do now -- this is how it works, and we do something similar for pandas, a bit messy though.

Additional context
Milan J came to office hours today asking about this, thanks!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Object/Blob store capabilities #467

Object/Blob store capabilities #467

elijahbenizzy commented Dec 11, 2024

Object/Blob store capabilities #467

Object/Blob store capabilities #467

Comments

elijahbenizzy commented Dec 11, 2024

Push to the serialization layer

Push to the persistence layer