Dynamic execution for solving equations #6017

ifethereal · 2021-12-19T09:41:42Z

ifethereal
Dec 19, 2021

Hi all! I am exploring Dagster as a proof-of-concept for some scientific computing. I have some example (working) Dagster code below, and I am wondering whether it would be possible to extend the Dagster code I wrote below to meet my use case.

My use case is that I am given a (predetermined) set of equations in several variables and I want to solve for the unknown variables given particular values of known variables, allowing for any combination of variables to be the known ones.

For Dagster, I imagine this corresponds vaguely to a question of whether it's possible to dynamically (and repeatedly/recursively) alter the execution flow. The first op to run would depend on (I imagine) the op execution context, and each op to run after that would depend on the outputs from the previous ops as well as the op execution context.

Assumptions:

Given any combination of known variables, there is always some DAG of other Dagster ops that solves for all the unknown variables. If there are multiple such DAGs then they produce the same overall output.
There is no way to know in advance which variables would be the known ones.
The data for known variables, provided they sit inside some contiguous region, mathematically result in exactly 1 solution for the specific equations.

My code below demonstrates, for a particular set of equations (in 4 variables—see code comment below), how I would attempt this problem if the question of which variables are known had a predetermined answer (thus falling short of satisfying my use case). Separately, given values for known variables, there is always some DAG consisting entirely of resolve_xx() ops that can solve for the values of the unknown variables. Note that the resolve_xx() ops are (manually) vectorised in this case.

As a side note, this feature is available in (and seemingly the hallmark feature of) a different Python package schedula, so I was wondering whether Dagster was capable of something similar.

from math import log, exp

from dagster import op, job
from typing import List


"""
    A / B = C
    C = exp(D)
"""


@op
def resolve_00(A: List[float], B: List[float]) -> List[float]:
    return [a / b for (a, b) in zip(A, B)] # C


@op
def resolve_01(C: List[float], B: List[float]) -> List[float]:
    return [c * b for (c, b) in zip(C, B)] # A


@op
def resolve_02(C: List[float], A: List[float]) -> List[float]:
    return [a / c for (c, a) in zip(C, A)] # B


@op
def resolve_10(D: List[float]) -> List[float]:
    return list(map(exp, D)) # C


@op
def resolve_11(C: List[float]) -> List[float]:
    return list(map(log, C)) # D


@op
def start_A() -> List[float]:
    return [2., 4., 6.]


@op
def start_B() -> List[float]:
    return [1., 2., 3.]


@op
def display(context, l):
    context.log.info(f"Resolved to {l}.")


@job
def main():
    A = start_A()
    B = start_B()

    C = resolve_00(A, B)
    D = resolve_11(C)

    display(C)


if __name__ == "__main__":
    result = main.execute_in_process(
        run_config={
            "loggers": {
                "console": {"config": {"log_level": "INFO"}}
            }
        }
    )

Answered by OwenKephart

Dec 20, 2021

This should actually be possible (but potentially quite difficult) using conditional branching. Dagster requires that the structure of the DAG be static through the duration of a given run, but the subset of a DAG that gets executed can be determined dynamically at runtime. So conceptually, if you define a DAG that contains all necessary sub-DAGs, then you've sort of solved your problem (if you can determine which sub-DAG you need at runtime). Your graph could start off with an output for each of your variables A, B, C, D. These Outs could be marked with is_required=False, and only fired if they are available for a given run. Downstream ops consuming these outputs would only execute if ea…

View full answer

OwenKephart · 2021-12-20T23:25:58Z

OwenKephart
Dec 20, 2021
Maintainer

This should actually be possible (but potentially quite difficult) using conditional branching. Dagster requires that the structure of the DAG be static through the duration of a given run, but the subset of a DAG that gets executed can be determined dynamically at runtime. So conceptually, if you define a DAG that contains all necessary sub-DAGs, then you've sort of solved your problem (if you can determine which sub-DAG you need at runtime). Your graph could start off with an output for each of your variables A, B, C, D. These Outs could be marked with is_required=False, and only fired if they are available for a given run. Downstream ops consuming these outputs would only execute if each of their inputs were fired.

However, in practice I think it would be quite difficult to write such a DAG in a way that guaranteed that you wouldn't redo work you had already done. The graph would need multiple copies of each op, and get pretty complex if you had an appreciable number of variables you wanted to handle.

Another option would be to move a bit more of the control flow logic to the body of the op rather than have Dagster manage all of it. The main benefit of doing it this way is mostly that it would likely be significantly easier to write this code (at least from my perspective). At the extreme end would be putting all of the necessary logic inside a single op, but you could also make different tradeoffs. You could for example have a graph that was just a bunch of copies of this op chained together:

@op(out={"a": Out(), "b": Out(), "c": Out(), "d": Out()})
def derive_abcd(a, b, c, d):
    if c is None and d:
        c = D_to_C(d)
    if d is None and c:
        d = C_to_D(d)
    if a is None and b and c:
        a = BC_to_A(b, c)
    if b is None and a and c:
        b = AC_to_B(a, c)
    if c is None and a and b:
        c = AB_to_C(a, b)

    yield Output(a, "a")
    yield Output(b, "b")
    yield Output(c, "c")
    yield Output(d, "d")

I think the right approach would just depend on what benefits you hope to get out of using multiple ops over a single function. Happy to discuss more though :)

3 replies

ifethereal Dec 21, 2021
Author

Thanks for the response!

However, in practice I think it would be quite difficult to write such a DAG in a way that guaranteed that you wouldn't redo work you had already done. The graph would need multiple copies of each op, and get pretty complex if you had an appreciable number of variables you wanted to handle.

You're quite right here. To that end, one starting point that came to mind (in terms of how to generate this DAG) was this supersequence problem in combinatorics. I believe an algorithm that generates such a supersequence could be used to identify how to programmatically construct this DAG so that all the sub-DAGs were contained within. Your code snippet cloned and chained together is also roughly meant to solve this problem—what I'm afraid of is that the set of equations may not always be so benign as to lend itself easily to this "stencil" approach (where designing the stencil may require some thought and knowing the number of copies to use may be subtle).

I produced what I thought was an honest attempt here (Gist link) using the supersequence approach. I decided to introduce ops that would "collect" the solved results up until that point in the DAG (the standby_x() ops). The idea behind this is that it (hopefully) shouldn't be necessary for the DAG to know which op the solution for a variable came from, as long as there is a solution. Running it, however, quickly shows that my "collection" ops don't actually work as intended.

This leads me to my next Dagster question:

Is there some way to declare "optional" dependencies (coming from different ops) such that downstream ops can continue to execute even if some/all of the optional dependencies were skipped?

Some kind of "dynamic" fan-in but compatible with the way I'm generating the DAG (via GraphDefinition/DependencyDefinition—clunky, but it seemed the obvious way for me to use the supersequence easily).

My intent was to propagate whatever solved/known values (if any) through the rest of the DAG, but the way I've designed it unfortunately seems to require that each possible method of solving for a particular variable actually produces the solution. For example, I have provided known values for A and B, so standby_A/1 shouldn't be skipped just because one of the ops that solves for A (BC_to_A/0) produces no answer.

2021-12-22 02:00:51 +1100 - dagster - INFO - resolve - 19650fcc-3417-44d5-9b79-aad83d4fe6c1 - standby_C/0 - Solid standby_C/0 did not fire outputs {'result'}
2021-12-22 02:00:51 +1100 - dagster - INFO - resolve - 19650fcc-3417-44d5-9b79-aad83d4fe6c1 - BC_to_A/0 - Skipping step BC_to_A/0 due to skipped dependencies: [].
2021-12-22 02:00:51 +1100 - dagster - INFO - resolve - 19650fcc-3417-44d5-9b79-aad83d4fe6c1 - standby_A/1 - Skipping step standby_A/1 due to skipped dependencies: ['BC_to_A/0'].

OwenKephart Dec 21, 2021
Maintainer

This is a fascinating problem, thanks for posing!

I created my own version of this, which is not fully correct, but hopefully makes some progress towards your goal:

https://gist.github.com/OwenKephart/b6685e4e52469d708d793b8df1ed500d

The idea is that you can indeed do what you're describing (just require that at least one op has produced the output) by using lists of optional outputs. In Dagster, if at least one element of a list of optional outputs has been produced, then the downstream input is satisfied, otherwise the op will be skipped.

Inside the ops, you can just take the first element of the list and use that for your computation (even if there are multiple, they should be guaranteed to be identical, so it doesn't matter which one you take).

The pattern of using loops inside a job definition (especially when some of the outputs are optional) is pretty rare and can be hard to parse, but if you open it up in Dagit it should be easier to see what's going on. This is probably a fairly inefficient way of doing things on the whole (and it will probably create quite complex graphs), but perhaps this is sufficient inspiration :)

ifethereal Dec 24, 2021
Author

Thanks so much for the close engagement! Proud to provide here Dagster code that now fulfils my initially stated use case for the set of equations we've been considering so far:

https://gist.github.com/ifethereal/eb7b10bcd8732353463a7c06fb2bb910

This was the crucial piece for me:

In Dagster, if at least one element of a list of optional outputs has been produced, then the downstream input is satisfied, otherwise the op will be skipped.

so thanks for pointing it out! Your use of lists also inspired some improvements for my "collection" ops.

For the future reader, the code still has some rough edges, but I think my main structural Dagster questions have all been answered. These are further problems that I would consider solving before calling this anything more than a proof-of-concept:

Vectorisation: This is potentially not so difficult by making the code aware of one (additional) level of nesting, although this would take some experimenting I imagine.
Reducing redundant computations: Some of the computations may be pointlessly repeated to re-solve for a variable whose value was already known.
Support for solution ops solving for multiple variables at once: For this particular set of equations, it happens that each solution op is of the form of taking one of the equations where all but one of the variables is known and solving for the (single) unknown variable involved in the equation. There certainly exist scenarios where solution ops might not take this form. For example, consider the equations A + 2B = C and A + 3B = D where C and D are known.
Improving developer experience: It should be easy for a separate developer to take this architecture and apply it to a different set of equations easily by providing all the necessary solution ops. In particular the latecomer should ideally not need to know very much about the details of the architecture (for example the supersequence approach used here). I imagine this would involve designing more op factories or graph factories.
Non-exactness of solution ops: Depending (very) closely on the particular set of equations, if there are multiple pathways of obtaining values for all variables, it's conceivable that some pathways may suffer from worse numerical accuracy than others. This includes problems of floating-point precision but very likely far more than that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic execution for solving equations #6017

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Dynamic execution for solving equations #6017

ifethereal Dec 19, 2021

Replies: 1 comment · 3 replies

OwenKephart Dec 20, 2021 Maintainer

ifethereal Dec 21, 2021 Author

OwenKephart Dec 21, 2021 Maintainer

ifethereal Dec 24, 2021 Author

ifethereal
Dec 19, 2021

Replies: 1 comment 3 replies

OwenKephart
Dec 20, 2021
Maintainer

ifethereal Dec 21, 2021
Author

OwenKephart Dec 21, 2021
Maintainer

ifethereal Dec 24, 2021
Author