-
Notifications
You must be signed in to change notification settings - Fork 57
workflow
Workflow trichotomy:
-
'makefile'-style -- reverse dependency graph
- Know your endpoint but not your beginning
- strong idempotency
- Doesn't like to be dependent on data
- describe dependencies of products
- triggering a product triggers backwards on graph until products are grounded
-
'script'-style -- forward-running workflow graph
- set of steps
- triggered in order
- Know your beginning but not your endpoint [may have many choices]
- Know your direction/velocity
-
Resource invocation -- imperative actions
- assemble resources (
- trigger actions on those resources
- for example, a 'git repo' resource: you can ensure its existence (by cloning), set to a specific branch or commit, delete it, pull, fetch, push, merge.
For any defined abstraction layer:
- Only important that the contract is adhered to
- No implication that there are lower level abstraction layers
- May show a forward-looking vision of elegant lower level abstractions
(Having a graph will help you express parallel execution)
- can refer to a job by its intrinsic info
- can refer to a job not yet defined
chain :twitter_parse do
wukong_rb ‘parse_api.rb’
pig ‘uniq_and_unsplice.pig’
end
Wukong.workflow(:launch) do
task :aim do
#...
end
task :enter do
end
task :commit do
# ...
end
end
Wukong.workflow(:recall) do
task :smash_with_rock do
#...
end
task :reprogram do
# ...
end
end
Wukong workflows work somewhat differently than you may be familiar with Rake and such.
In wukong, a stage corresponds to a product; you can then act on that product.
Consider first compiling a c program:
to build the executable, run `cc -o cake eggs.o milk.o flour.o sugar.o -I./include -L./lib`
to build files like '{file}.o', run `cc -c -o {file}.o {file}.c -I./include`
In this case, you define the steps, implying the products.
Something rake can't do (but we should be able to): make it so I can define a dependency that runs last
A run
is the event that ensues when you invoke a workflow. Invoking the bake_pie
workflow at 01:20:55 on Jan 30, 2012 results in the bake_pie-20120130012055
run.
A stage
is a data process having
- one input, an array of length one called
inputs
. (later: multiple inputs, named inputs) - one output, called
output
(later: multiple outputs, named outputs) - (later) an error channel named
:error
.
Any stage can be invoked by name; only that stage is executed.
A chain
runs a sequence of stages, one after the other, in order. A chain
is itself is a stage
; it has an array of sub-stages (called steps
) that it will execute in order.
- the input to the chain becomes the input to the first stage, and the output of the last stage becomes the output of the chain.
You can of course invoke any stage within a chain directly.
A shell_process
invokes the swineherd runner.
- hash of config variables
- ?ordered? inputs
- one output, named
:output
, and an error channel named:error
By default, a stage’s inputs are specified by the outputs of its dependencies.
The output asset names are constructed from the stage’s metadata. There is a small set of pathname templates (in fact, only one):
- Development mode output pathname template
somehow: %{user}, %{run_id}, %{session}, %{run_index}, %{prod|dev|test}
(?implement a template that you think works, those are some possible ingredients we’ll codify &/or fix?)
- (later) Automated mode output pathname template (used when deployment class is
prod
andtest
):/%{project_path}/%{run_id}/%{transformed_stage_name}-%{deployment_class}
(just implement something sensible, we’ll figure out the details)
somehow: %{user}, %{session}, %{run_index}, %{prod|dev|test}, %{timestamp}
-
project_path: A container for runs for the same purpose/project
-
session: A temporally close connected set of runs
-
run_index: An auto-incremented counter for the runs
-
deployment_class: The type of deployment instantiation. These may be used for more than one granularity of sets of run.
-
run_id: The time the run started and some other information to uniquely identify this specific invocation of the workflow. (?complete as you find natural?)
-
timestamp: timestamp of run. everything in this invocation will have the same timestamp.
-
user: username;
ENV['USER']
by default -
sources: basenames of job inputs, minus extension, non-
\w
replaced with '_', joined by '-', max 50 chars.
Normally, one should not rename inputs and output. However, there are some (hopefully rare) cases where they may be renamed. Example cases include:
You can override the default input name to adapt to external processes:
- (show how)
- (make sure I can still inject an explicit name at execution time)
You can also inject an explicit name:
- (show how)
...
-
handled by configliere:
nukes launch --launch_code=GLG20
-
TODO: configliere needs context-specific config vars, so I only get information about the
launch
action in thenukes
job when I runnukes launch --help
- when files are generated or removed, relocate to a timestamped location
- a file
/path/to/file.txt
is relocated to~/.wukong/backups/path/to/file.txt.wukong-20110102120011
where20110102120011
is the job timestamp - accepts a
max_size
param - raises if it can't write to directory -- must explicitly say
--safe_file_ops=false
- a file
each action
-
the default action is
call
-
all stages respond to
nothing
, and like ze goggles, donothing
. -
clobber
-- run, but clear all dependencies -
undo
-- -
clean
-- -
create
-- -
update
-- applies the given *note the difference -
delete
-
invoke
-- -
run
--
The primitives correspond heavily with Rake and Chef. However, they extend them in many ways, don't cover all their functionality in many ways, and incompatible in several ways.
-
directory
,symlink
-
template
-- fill in a file with variables supplied at runtime -
remote_file
-- git_repo
-
script
- with specializations like
hadoop_job
,r_script
- with specializations like
-
remote_request
-- call to an external product over the network-
http_request
-- a type of remote request
-