Caching robustness and determinism #3270

leventov · 2024-12-20T17:06:48Z

leventov
Dec 20, 2024

Hashing strategy for "execution path" refs could lead to inconsistent notebook state

If automatic cell re-run is off, and an upstream cell is non-deterministic, then upon re-running the upstream cell the notebook is in inconsistent state and if at this moment the notebook is closed, then upon restart there would be no way to know for Marimo runtime that the cache result for the downstream cell are invalid.

Although BlockHasher's doc mentions this (albeit this isn't even a part of user-facing docs!), the phrase "sources of non-determinism are not accounted for in this implementation, and are left to the user" is not helpful. The core of Marimo's caching logic is exactly the place to deal with this, at least in a good fraction of cases.

If the upstream cell is cached persistently (most likely because all cells are cached persistently, i.e., via #3054), then cached content hashing is vital for correctness and corruption prevention (see #3176). Then, if the upstream cell has been run in the current marimo notebook runtime (either forcefully, or because its own dependencies are invalidated), its results (i.e., "content") have also been serialised and cached, and thus the content hash for these latest results has also been recorded.

This means that we can include upstream execution path's content hash along with its module hash in the calculation of the block hash cheaply and at the same handle non-determinism much better.

File inputs

For even better handling of non-determinism, "file inputs" to cells should be tracked and also being part of the block hash (#3258). Thus, if the non-deterministic cell writes out different contents to these non-python files but has the same "content", the downstream cells would still record a cache miss and re-run.

dmadisetti · 2024-12-20T18:35:34Z

dmadisetti
Dec 20, 2024
Collaborator

For more determinism guarantees, see experimental strict mode: https://github.com/marimo-team/marimo/releases/tag/0.6.20

If automatic cell re-run is off, and an upstream cell is non-deterministic, then upon re-running the upstream cell the notebook is in inconsistent state and if at this moment the notebook is closed, then upon restart there would be no way to know for Marimo runtime that the cache result for the downstream cell are invalid.

Not really, because execution path is really the fallback.

cached content hashing is vital

Yep, that's the primary hashing check.

Caching does the best it can by relying on the contents of the variables it uses, and only falls back on execution mode if it is unable to hash the variable contents. It is possible to have an unhashable datatype that mutates between executions- but then your notebook is not deterministic regardless of persistent_cache.

As long as the variable data is hashable, then cache should be deterministic. If execution path is utilized- then as a limitation of python it is not possible to guarantee determinism, since file state and hidden memory can accumulate in ways that marimo cannot detect.

You can mitigate against this by using strict mode- which requires a deep copy between each cell execution, and also has some other scheduling guarantees.

But this of course is flawed in terms of performance.

There's potential for a COW (copy on write mode), that might be an intermediate here- but I think this might be a little tricky to properly implement.

albeit this isn't even a part of user-facing docs!

BlockHasher is more of an implementation detail than anything stable an end-user should utilize, but maybe a whitepaper is in order

0 replies

dmadisetti · 2024-12-20T18:41:09Z

dmadisetti
Dec 20, 2024
Collaborator

I do like your latter suggestion of tracking file contents.
I think that this would be a good feature when additional files are tracked.

0 replies

leventov · 2024-12-21T05:40:42Z

leventov
Dec 21, 2024
Author

Caching does the best it can by relying on the contents of the variables it uses, and only falls back on execution mode if it is unable to hash the variable contents. It is possible to have an unhashable datatype that mutates between executions- but then your notebook is not deterministic regardless of persistent_cache.

Could you please point to the place in code that does this?

Between these lines in BlockHasher.__init__():

marimo/marimo/_save/hash.py

Lines 357 to 386 in 1a1db27

    
           # Hold on to each ref type 
        
           self.content_refs = set(refs) 
        
           self.execution_refs = set(refs) 
        
           self.context_refs = set(refs) 
        
           # Default type, means that there are no references at all. 
        
           cache_type: CacheType = "Pure" 
        
           # TODO: Consider memoizing the serialized contents and hashed cells, 
        
           # such that a parent cell's BlockHasher can be used to speed up the 
        
           # hashing of child. 
        
           # Collect references that will be utilized for a content hash. 
        
           content_serialization: dict[Name, bytes] = {} 
        
           if refs: 
        
               cache_type = "ContentAddressed" 
        
               refs, content_serialization, stateful_refs = ( 
        
                   self.collect_for_content_hash( 
        
                       refs, scope, ctx, scoped_refs, apply_hash=False 
        
                   ) 
        
               ) 
        
               self.stateful_refs |= stateful_refs 
        
           self.content_refs -= refs 
        
           # If there are still unaccounted for references, then fallback on 
        
           # execution hashing. 
        
           if refs: 
        
               cache_type = "ExecutionPath" 
        
               refs = self.hash_and_dequeue_execution_refs(refs) 
        
           self.execution_refs -= refs | self.content_refs

And collect_for_content_hash() impl:

marimo/marimo/_save/hash.py

Lines 453 to 482 in 1a1db27

    
           def collect_for_content_hash( 
        
               self, 
        
               refs: set[Name], 
        
               scope: dict[str, Any], 
        
               ctx: Optional[RuntimeContext], 
        
               scoped_refs: set[Name], 
        
               apply_hash: bool = True, 
        
           ) -> SerialRefs: 
        
               self._hash = None 
        
               refs, content_serialization, _ = ( 
        
                   self.serialize_and_dequeue_content_refs(refs, scope) 
        
               ) 
        
               # If scoped refs are present, then they are unhashable 
        
               # and we should fallback to normal hash or fail. 
        
               if unhashable := (refs & scoped_refs) - self.execution_refs: 
        
                   # pickle is a python default 
        
                   import pickle 
        
                   failed = [] 
        
                   exceptions = [] 
        
                   # By rights, could just fail here - but this final attempt should 
        
                   # provide better user experience. 
        
                   for ref in unhashable: 
        
                       try: 
        
                           _hashed = pickle.dumps(scope[ref]) 
        
                           content_serialization[ref] = type_sign(_hashed, "pickle") 
        
                           refs.remove(ref) 
        
                       except (pickle.PicklingError, TypeError) as e: 
        
                           exceptions.append(e) 
        
                           failed.append(ref)

I don't see how what you are saying takes place. Moreover, self.execution_refs are used inside collect_for_content_hash() impl essentially before this set is "fully" initialised in __init__, namely only after self.execution_refs = set(refs). Isn't this a bug?

0 replies

leventov · 2024-12-21T05:48:48Z

leventov
Dec 21, 2024
Author

It is possible to have an unhashable datatype that mutates between executions- but then your notebook is not deterministic regardless of persistent_cache.

A remark here is that even if the notebook is knowingly non-deterministic, e.g., it involves LLM generations with non-zero temperature that are used as inputs in downstream cells, it's still good to definitely know at the marimo level (and figure-outable across sudden runtime shutdowns, restarts, etc.) whether these non-deterministic cells are consistent with each other, i.e., whether they "resulted (or could have resulted) from a blank-slate notebook run top-to-bottom, without using cache".

0 replies

leventov · 2024-12-21T06:04:13Z

leventov
Dec 21, 2024
Author

`primitives.is_pure_function`

Returns primitives.is_pure_function returns True on calling with builtin open() (for opening files).

0 replies

dmadisetti · 2024-12-21T09:43:23Z

dmadisetti
Dec 21, 2024
Collaborator

Could you please point to the place in code that does this?

Just a little further down:

marimo/marimo/_save/hash.py

Line 399 in 1a1db27

if apply_content_hash:

You can refer to the unit tests for edge cases if you are concerned: https://github.com/marimo-team/marimo/blob/1a1db277543da001fe8f1b0c84b7a9632c4645bd/tests/_save/test_hash.py

Re the proposed bug IIRC, execution_refs in collect_for_content_hash are for @cache and that conditional is meant to be a no op otherwise (@cache has a bit more aggressive cache invalidation)

Excited if you can think of a non-deterministic case! We did spend a fair bit of time thinking through edge cases

it's still good to definitely know at the marimo level

The hash mechanism is denoted as a prefix to the hash (e.g. E_ C_), and the "hash type" information is available on the resultant cache object. Did you have a different idea here?

primitives.is_pure_function returns True on calling with builtin open() (for opening files)

That is a good observation, I believe that is_pure_function also blindly considers library calls as pure as well. I think #3258 might be the best way forward here.

I'll read over the docs again, but they shouldn't sound like caching is totally bullet proof. I'm fairly certain we call out randomness, network-requests and file system concerns

0 replies

leventov · 2024-12-21T12:15:57Z

leventov
Dec 21, 2024
Author

Re the proposed bug IIRC, execution_refs in collect_for_content_hash are for @cache and that conditional is meant to be a no op otherwise (@cache has a bit more aggressive cache invalidation)

How it's not a certainty that the if in line 467 is never taken?

execution_refs equals refs in line 362.
In line 462, refs only get shrinked (or remain the same)
Then, (refs & scoped_refs) - self.execution_refs - shrink it even further (find intersection with scoped_refs) and then remove execution_refs (still equal to the original refs) - the result must be empty.

`serialize_and_dequeue_content_refs`

The hash mechanism is denoted as a prefix to the hash (e.g. E_ C_), and the "hash type" information is available on the resultant cache object. Did you have a different idea here?

I don't see how serialize_and_dequeue_content_refs handles any case where the variable is a simple data class that I've defined. For example:

@dataclass
class A:
    text: str

@app.cell
def one():
    a = A(text=llm.prompt("blah", temperature=1))

@app.cell
def two():
    _ = llm.prompt("do something with " + a.text)

I don't see how the current code prevents the above app from falling in the trap that I described in the original message of this issue: cell "one" is re-run (and its persistent_cache is updated), then app crashes / laptop powers off / whatever, then marimo reloads the notebook and it will happily load the persistent cache for cell "two" without indication that it's inconsistent with cell "one"'s output.

`is_pure_function`

That is a good observation, I believe that is_pure_function also blindly considers library calls as pure as well. I think #3258 might be the best way forward here.

Ok, I'm now not sure at all what's the purpose of that is_pure_function call inside serialize_and_dequeue_content_refs(). Functions with side effects such as open are tolerated, and in fact that's often the point of caching to skip exactly such blocks that do something heavy and then write out those heavy results to disk, database, etc. But non-deterministic functions should also be tolerated because again, when I want a cell to be cached I'm pretty clear that I don't want to immediately make it moot by considering all reads of the current time, random generator calls, etc. to invalidate the cache (that is, always invalidate). Clearly, in the vast majority of cases, the user's intent will be to only invalidate the cache for the cell upon changes of explicit inputs from other cells, not the things that change all the time in the background.

So, I just don't understand what that call is supposed to "catch" there. (Let alone, as we found out, is_pure_function is a misnomer; so what is_pure_function actually does is also very unclear.)

0 replies

dmadisetti · 2024-12-21T14:53:20Z

dmadisetti
Dec 21, 2024
Collaborator

is_pure_function is meant to determine if on the surface level mutations can occur within a function.
The "point" of is_pure_function is reduce the need of relying on execution_cell hashes, by making certain functions content addressable. Maybe is_pure_enough_function would be a better name.

Re your simple example: Cell 2 will not run unless Cell 1 has, so I am unsure of the issue. In normal marimo, Cell 2 can be run manually if Cell 1 has failed (note a cache mechanism failure results in rerun)- but this behavior can be disabled with the before mentioned strict mode. Your example can also be written in a way such that there is no issue:

@dataclass
class NotContentAddressable:
    text: str

@app.cell
def one():
    with mo.persistent_cache("llm_call") as cache:
        a = NotContentAddressable(text=llm.prompt("blah", temperature=1))

@app.cell
def two():
    text = a.text
    with mo.persistent_cache("llm_feed") as cache:
        _ = llm.prompt("do something with " + text) # Will now be content hash vs execution hash since text is 'primitive'
    assert cache.cache_type == "ContentAddressed"

If your LLM is non-deterministic, then maybe consider utilizing a seed. Caching and non-determinism do not mix well. From your feature specs it seems like you would like something more like an archive with quick retrieval, which I am not certain falls under the current scope of the caching feature.

Re robustness- I appreciate your concern, but I do think it's a little misplaced unless you can provide concrete examples of where the hashing mechanism fails barring cases with side effects.

I'm moving this over to "discussion" since it seems that better documentation/ communication is required for caching- but I just made an issue for incorporating tracked files into cache (#3271).

But feel free to keep the discussion going !

4 replies

leventov Dec 21, 2024
Author

Your example can also be written in a way such that there is no issue

Yes, it can. But why there should be difference in behaviour with/without the text = a.text line, from the perspective of the user? It seems highly arbitrary. It violates the principle of least astonishment. And it just doesn't feel right that the framework requires the users to perform these "pointless" rituals in order to make the code work as they would expect.

And I also don't see any strong justification for keeping things this way.

As I outlined in the very beginning of this discussion, if all cell results are cached (or at least that upstream cell 1's results are cached), we already have the hash for that a thing. And so we can plug this hash in (I went into more details about how this plugging-in could be done here: it would probably require a weak identity dict, but it should be possible).

So, in fact, apart from ironing out this arbitrariness, the change would also bring the optimisation that is outlined in the code already:

marimo/marimo/_save/hash.py

Lines 365 to 367 in 112190c

    
           # TODO: Consider memoizing the serialized contents and hashed cells, 
        
           # such that a parent cell's BlockHasher can be used to speed up the 
        
           # hashing of child.

leventov Dec 21, 2024
Author

if all cell results are cached (or at least that upstream cell 1's results are cached), we already have the hash for that a thing.

In fact, the issue that I've described could manifest only when both cells are persistently cached. So kind of solving it only for these cases (i.e., when both parent & child are persisted) would also be sufficient, in the minimal version of the proposed change.

dmadisetti Dec 21, 2024
Collaborator

But why there should be difference in behavior with/without the text = a.text line, from the perspective of the user?

From the perspective of the user, there isn't - because execution hash is in general sufficient and an under the hood detail.

The example you posted could never happen since a wouldn't be defined, and would bust the cache since the value of a being now undefined, invalidates the second cache

I'm receptive to a screenshot or notebook of where this inconsistent state is created. Here's your example in a notebook https://marimo.app/l/jiczts

leventov Dec 23, 2024
Author

I cannot reproduce this scenario right now because there is no way to "forcefully re-run" the persistently cached cell (or, if the cell is not itself persistently cached, as it is not even currently possible, but has persistent_cache blocks -- in this case, this action would work as "re-run the cell, ignoring any nested persistent_cache() blocks' cached results - run it as if the cell was run for the very first time".

This action can only be approximated now by changing the code of the cell (even if dummily, like adding extra line like foo = 1 to the persistently cached block), and well, currently this would also make downstream cells query non-yet-existent module_hash as the cache key -- thus, no consistency issue, all good.

However, in the land of #3054 the action "force re-run the cell without changing its code" must definitely be present (equivalently, it's already badly needed if someone sets out to use persistent_cache() profusely, because constant fiddling with blocks' source is inconvenient and wouldn't scale, interfere with code version control, etc., and invalidating ALL cache is not a good option).

So, if that action was present, and with the following notebook:

import marimo

__generated_with = "0.10.6"
app = marimo.App(width="medium")

@app.cell
def cell_1():
    import dataclasses
    @dataclasses.dataclass
    class A:
        text: str
    return A, dataclasses


@app.cell
def cell_2():
    import random as r
    import time
    return r, time


@app.cell
def cell_3(r):
    def llm(prompt, temperature=None):
        return "LLM:" + str(r.randint(1, 100))
    return (llm,)


@app.cell
def cell_4():
    import marimo as mo
    return (mo,)


@app.cell
def cell_5(A, llm, mo, time):
    with mo.persistent_cache("llm_call") as _cache:
        a = A(text=llm("blah", temperature=1))
    time.sleep(10)
    print(a.text)
    return (a,)


@app.cell
def cell_6(a, mo):
    with mo.persistent_cache("llm_call2") as _cache:
        out = "do something with " + a.text
    out
    return (out,)

If the following sequence of actions was performed, exactly:

I've opened the notebook for the first time on my local machine with marimo edit and run it top-to-bottom.
I've pressed the button "forcefully re-run" on cell_5 (maybe, I didn't like the output LLM gave me the first time).
When cell_5 was still running, at the time.sleep(10) line (emulating a heavy or remote computation step; it would be even more realistic if that heavy computation was a separate cell, turning a into a2 with a different text content, but you get the idea), I run kill -9 <marimo edit pid> from my command line.
I re-open the notebook with marimo edit

(Note: it appears that I don't even need "disabled cells" for this, although with them, the time frame for kill -9 would become arbitrary long.)

So, if this entire sequence of actions was performed, when I re-open the notebook,
My claim:

marimo would load the cached results of all cells immediately, 100% cache hit rate, no re-execution, and no indication that the cached result for cell_6 (or block within it, which makes no difference) is inconsistent with the output of cell_5.
Your claim: marimo would re-run the cached block in cell_6?

My understanding was that we were previously on the same page about this, which I took from your earlier phrase "your example can also be written in a way such that there is no issue", from which I understood as your recognition text = a.text line does make semantic difference, and there is some kind of "issue", and we were more like arguing about the importance of it, or the justification for fixing it.

dmadisetti · 2024-12-23T08:26:22Z

dmadisetti
Dec 23, 2024
Collaborator

As is, there's no forceful rerun and no persistent cache execution mode, so this case is a hypothetical- but I hear you. I think this is a good design consideration for #3054

I think caching and non-determinism do not mix well- and an easy cache invalidation is a reason for the required cache "naming" mechanism. I think the best way to implement your type of "forced rerun" would be changing a seed value, which would invalidate the cell and downstream values. Or, in the case where you cannot control the seed- you can add a slider and reference it in the cell (for now). This UI/ api could be improved specifically for cache invalidation (and in your case LLM prompts)

and be deterministic. Reproducibility is a key focus of the caching mechanism, and a "forced" rerun is best done via cache miss

6 replies

leventov Dec 23, 2024
Author

Re: determinism: OpenAI and other big providers already don't guarantee determinism even if you provide the seed parameter:

seed | integer or null | Optional
This feature is in Beta. If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.

And going forward, we should probably expect even less determinism in general. Partially perhaps because making GPUs cheaper and more energy efficient requires making them less exactingly precise, partially perhaps it could be some product decisions such as for preventing exactly-reproducible model jailbreaks or something.

Re: UI fix: I don't understand yet what you have proposed, but as a general note, I think UI shouldn't be on the critical path to solving issues like this because of legitimate "headless" use cases, e.g., when AIs themselves manipulate these notebooks as code, and rely on Marimo infra (including caching) for running the notebook, but not for displaying it.

Re: opt-in: I agree there are gotchas with pickling. Thus, definitely, bold warnings for persistent_cache and the doc for #3054 are required. However, it seems to me an "even further" concern than what we are discussing here to improve the consistency/robustness/reporducibility beyond fixing the issue we have been discussing up to this point on this page? Or am I missing something?

This is because, the failure scenario as I see it that you describe:

cell_5 creates an object with a tricky state that gets lost over pickle-unpickle rounds.
cell_6's behaviour depends on those tricky "potentially lost via pickling" parts of the state.
Both cells are cached. cell_5 re-run (and cache is not updated because pickling output didn't change), then notebook killed without cell_6 re-running.
We load the notebook again, and may see "inconsistency without indication".

However, with the current behavior of caching, that would be exactly the same problem. The user can only avoid it only by unearthing that "tricky inner part of the object state that is missed by pickle" by storing it as a cell-level "exported" variable. So, this solution, requires explicit knowledge by the user about this specific gotcha: it's very much like this solution is "opt in" itself.

Second solution, well, is to not cache these tricky cell results or blocks at all. This solution doesn't seem to me to interact with the issue we are discussing here: it's, like, general precaution about caching. As I remarked above, we should document that.

The third alternative is to "soft invalidate" downstream cells, for example, via keeping the list of downstream invalidated module_hashes along with cell cache results (such as, in the case above, cell_5 would store the module_hash of cell_6 that depends on it) even if the "force cell re-run" didn't update the pickled results.

If we just do this (store invalidated lists of downstream module_hashes), in files, rather than purely in marimo's app memory, this solution alone would also solve the issue we are discussing all along here, but it has its own issues:

When we are running the very first cell for the first time, we cannot yet compute module_hashes for all downstream cells, so we cannot yet store the aforementioned list of dependent cells. We can only do this when we run all the cells top-to-bottom. But, this itself creates new possible race situations. These race situations could still be addressed by storing "sentinel" lists-of-invalidated-module-hashes, and then updating them when the notebook is run top-to-bottom in full. This would also require these "lists" to be separate files, because otherwise we would "trash" potentially large results objects too much ("trash" from Git's point of view, as per Persistent cache with auto-cleanup #3176).
As you can see, complexity snowballs quickly, as well as potential extra performance implications (extra files to write, perhaps, extra files to read).
The "all or nothing" approach to storing the list of invalidated module_hashes means worse UX: it would to display the entire list of cells as "potentially need a re-run" until I did re-run all cells. But what if there is some very expensive cell below that I actually don't want to re-run (i.e., it would have been "opted in" for the "rely on pickle only"; or in the absence of that option, just use the current module_hash, which would also be possibly valid)? I would be fine with that symbol "please re-run the cell" for that last cell hanging arbitrarily long, but not for all cells above. To avoid that for all cells above, the "lists of possibly invalidated module hashes" for all upstream cells would need to be updated when each cell is re-run, ~~resulting in quadratic avalanche of read/write operations with files storing those invalidated module_hashes~~, as well as may interfere with concurrent/async notebook/cell execution (future features).
The above point can be remediated if instead of storing file for each cell recording invalidated downstream module_hashes, instead, each cell is associated with a "code only" hash (basically, the hash of all code of upstream cells and all modules as if they all were "pure" in BlockHasher parlance) and store the flag "potentially invalid" just for these hashes, one per cell. But this solution itself may be too aggressive with invalidation (for the same reason we care about content addressing in the current caching subsystem at all); it will disrupt possible future features like "I want to see how multiple runs of this notebook have unfolded in parallel, considering non-determinism, but without code updates".
What about cells that are attached to the same app but not known to this marimo instance, e.g., residing in a separate .py file? This is itself a source of bugs, because it would mean this "store lists" solution would still not guarantee perfect pickle-gotcha-avoidance, but may impart false confidence in the users.

So, I'm not sure that third alternative is the "cure" we want by default.

However, if we do make module_hashes on upstream hashes of cached (pickled) results, as I propose here all along, and make the alternative "store lists of invalidated module_hashes" inside of marimo runtime only (in a weak ref identity map or whatever, as usual), we will get the best of both worlds: less runtime penalty, less files to write, less cache invalidation, addressing the TODO that was already in BlockHasher code, ~~and get as much pickle-gotcha-resilience as we could possibly have~~.

The path to implementing that "best" option lies through first doing what I proposed and then adding that runtime set of invalidated module_hashes. Or, well, these could be part of the same PR/change set.

leventov Dec 23, 2024
Author

Correction: content-hashes + "list of invalid module_hashes in runtime" is not "as much pickle-gotcha-resilient as it could possibly get". "List of invalid module_hashes in files disk" would be still more resilient, even though with the associated extra complexity and performance penalty.

Content-hashes + list of invalid "module hashes incorporating those upstream content hashes" on disk would still be more ergonomic, and forward-looking, e.g., with features like "run the notebook multiple times and be able to select a run-down to observe through the UI". So, modulo the implementation complexity and possibly more race conditions involving marimo runtime killings, multiple runtimes working in parallel, etc., that I haven't thought through yet, that would be the most attractive combination for the user, probably.

leventov Dec 23, 2024
Author

Another correction: there is no need to store the "list of downstream invalidated module_hashes" for each upstream cell; There should be just a single such list per notebook (or, if you wish, multiple lists, one per completely independent DAG subtree as the notebook is running top to bottom; but without loss of generality, this could be single list. Multiple lists could only help slightly with concurrency). Still, this list may need to be updated (on disk) on each cell re-run, because the list of those module hashes reveals itself computationally through the notebook run, as content addressible (whether pickled or not) variable values become known.

This list is known ahead of time only if all cells are cached, as per #3054 or manually. Note that potential gotchas in pickle that may lead to divergence even in "pickle-observable" results down the road doesn't affect this, because then the module_hashes for those (cached) downstream cells would be something "never seen before", and hence, that list would just contain irrelevant module_hashes, which is benign.

The craziest edge case if that new module_hashes were seen before (such as via the other run-down of the notebook), giving "fork-and-merge" cell result pathways; but since this may only result from multiple marimo run-downs of the notebook (because notebooks are DAGs), it's not well-defined to declare that module_hash potentially invalid on the disk: that module_hash's cached result may have been produced by a perfectly benign run-down without any "picklings that lose information in between". So, those module_hashes should still be indicated as "requiring re-run" but only within marimo runtime.

dmadisetti Dec 23, 2024
Collaborator

I'll mock up what I have in mind for the UI element. As is, marimo lacks a good means of viewing "history", and I think some of your desire to heavily rely on cache comes from this.
I think it's a good addition in general, and can also be repurposed for other "non-deterministic runs".
I think I can actually use it in a project I'm working on now, so no time lost.

Re: UI fix:

This is something where there's been a bit of thought in terms of making UI not just confined to the interface, but some how expose that to scripts. It's frequently desirable to run a notebook with all possible enumerations of inputs (think papermill).

Re: opt-in:

You go from bugs being determinable and annoying- to bugs being untraceable and library specific.
That seems like a less ideal tradeoff to me, but maybe uncommon enough that it's fine.

I think you're getting a little in the weeds with the implementation details. There's no need for invalidating cache hashes if everything is done correctly. Here's the actionables I see from this conversation:

Handle transitive effects of UI / State when there are multiple cache blocks (bug)
Allow for content pickling in hash (feature)
(for discussion) Creating an interface for cache-busting and recording the history of non-deterministic cells (ui feature)

leventov Dec 23, 2024
Author

It makes sense to sequence 2.) after #3176 because content hashing will anyway require rework of the current persistent cache storage design, at least if in this version it would be just adding content hash suffix to the file names (i.e., <prefix>_<module_hash>_<content_hash>.pickle). Instead of introducing this transitionary persistent cache arrangement, moving directly to #3176 would be already a more forward-looking base. The 2.) feature also doesn't seem that urgent to require getting it out ASAP, so I think it can wait on #3176. Therefore, regarding this line of work (independent of 1.) above), I would suggest to move the main line of the discussion to #3176.

Caching robustness and determinism #3270

leventov Dec 20, 2024

Hashing strategy for "execution path" refs could lead to inconsistent notebook state

File inputs

Replies: 9 comments · 10 replies

dmadisetti Dec 20, 2024 Collaborator

dmadisetti Dec 20, 2024 Collaborator

leventov Dec 21, 2024 Author

leventov Dec 21, 2024 Author

leventov Dec 21, 2024 Author

primitives.is_pure_function

dmadisetti Dec 21, 2024 Collaborator

leventov Dec 21, 2024 Author

serialize_and_dequeue_content_refs

is_pure_function

dmadisetti Dec 21, 2024 Collaborator

leventov Dec 21, 2024 Author

leventov Dec 21, 2024 Author

dmadisetti Dec 21, 2024 Collaborator

leventov Dec 23, 2024 Author

dmadisetti Dec 23, 2024 Collaborator

leventov Dec 23, 2024 Author

leventov Dec 23, 2024 Author

leventov Dec 23, 2024 Author

dmadisetti Dec 23, 2024 Collaborator

leventov Dec 23, 2024 Author

leventov
Dec 20, 2024

Replies: 9 comments 10 replies

dmadisetti
Dec 20, 2024
Collaborator

dmadisetti
Dec 20, 2024
Collaborator

leventov
Dec 21, 2024
Author

leventov
Dec 21, 2024
Author

leventov
Dec 21, 2024
Author

`primitives.is_pure_function`

dmadisetti
Dec 21, 2024
Collaborator

leventov
Dec 21, 2024
Author

`serialize_and_dequeue_content_refs`

`is_pure_function`

dmadisetti
Dec 21, 2024
Collaborator

leventov Dec 21, 2024
Author

leventov Dec 21, 2024
Author

dmadisetti Dec 21, 2024
Collaborator

leventov Dec 23, 2024
Author

dmadisetti
Dec 23, 2024
Collaborator

leventov Dec 23, 2024
Author

leventov Dec 23, 2024
Author

leventov Dec 23, 2024
Author

dmadisetti Dec 23, 2024
Collaborator

leventov Dec 23, 2024
Author