You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
It is common to hold a reference to some file/large object in state. This can be, for example:
A PDF for RAG/ingestion
An excel document that we're generating
A large dataframe that was pulled from S3
The problem is that this can really crowd out the state, if we JSON serialize it. E>G> a large pdf could be mbs of base-64 data each time, and as state is currently saved in its entirety after every node execution, this can get even worse.
Describe the solution you'd like
A few ideas, still thinking through:
Push to the serialization layer
This is the current approach -- we actually do this for pandas dataframes. To do this effectively, the user would be a CachingObject data type and register serialization capabilities for it. Pydantic also could do this easily.
This would look something like this -- this is an immutable reference file (E.G. one that is passed in or generated externally). Note you'd have extra steps if you were generating a file.
Then the user could specify some sort of CachingFile (for lack of a better name) like we had above:
T=TypeVar("ObjectType")
classCachingFile(abc.ABC):
contents: Tcontents_hash: strlocation_data: dict# URI? TBD what extra data we need...@abc.abstractmethoddefto_bytes() ->bytes:
...
@abc.abstractmethoddeffrom_bytes() ->T:
...
The blob store would then interact with this -- calling to_bytes/from_bytes and comparing the hash/saving at the hash location. Then the user can implement a CachingFile or use one of the plugins (pandas, pdf, etc...)... THe db would store the serialized/deserialized version, and then delegate to the blob store when rehydrating or saving state.
Some other things to flesh out:
How to ensure we don't save it twice (the hash -- will need to put this in the interface...)
When it would be loaded. Dynamically? On startup? Configurable?
Describe alternatives you've considered
The first one above is easiest to do now -- this is how it works, and we do something similar for pandas, a bit messy though.
Additional context
Milan J came to office hours today asking about this, thanks!
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
It is common to hold a reference to some file/large object in state. This can be, for example:
The problem is that this can really crowd out the state, if we JSON serialize it. E>G> a large pdf could be mbs of base-64 data each time, and as state is currently saved in its entirety after every node execution, this can get even worse.
Describe the solution you'd like
A few ideas, still thinking through:
Push to the serialization layer
This is the current approach -- we actually do this for pandas dataframes. To do this effectively, the user would be a
CachingObject
data type and register serialization capabilities for it. Pydantic also could do this easily.This would look something like this -- this is an immutable reference file (E.G. one that is passed in or generated externally). Note you'd have extra steps if you were generating a file.
This is just an illustration -- it's missing two things:
Some options:
Push to the persistence layer
I think there's room for a
o9bject_store
abstraction here. Just like we have a persister, we could also have a blob store. E.G.Then the user could specify some sort of
CachingFile
(for lack of a better name) like we had above:The blob store would then interact with this -- calling
to_bytes
/from_bytes
and comparing the hash/saving at the hash location. Then the user can implement aCachingFile
or use one of the plugins (pandas, pdf, etc...)... THe db would store the serialized/deserialized version, and then delegate to the blob store when rehydrating or saving state.Some other things to flesh out:
Describe alternatives you've considered
The first one above is easiest to do now -- this is how it works, and we do something similar for pandas, a bit messy though.
Additional context
Milan J came to office hours today asking about this, thanks!
The text was updated successfully, but these errors were encountered: