-
Notifications
You must be signed in to change notification settings - Fork 30
Defining a seesaw warrior project
Here's how to use the seesaw kit for your own project and how to prepare your project for the ArchiveTeam Warrior.
We assume that your project has a number of small tasks that you want completed (e.g., downloading users) and that you have a way to hand out tasks to individual users and record its progress (i.e., you have a tracker).
You should define a pipeline, the sequence of steps that complete a task. Your pipeline will probably be something like this:
- Request an assignment from the tracker.
- Run Wget to download the file.
- Upload the downloaded file with rsync.
- Tell the tracker that the assignment is done.
You specify the pipeline as a Python file. When you run your project the seesaw kit interprets the file and executes the steps.
Your pipeline contains a sequence of tasks that should be executed. When you run the seesaw script (or the warrior), the script will feed items to your pipeline. Each item represents an individual work unit (e.g., a single user). Each task will work on an item, then pass it on to the next task in the pipeline. Items have properties that can be read and updated by the tasks. For example, the GetItemFromTracker
task could set the item_name
property that is then used by the WgetDownload
task to download a file.
Your pipeline definition is a normal Python file that initializes project
and pipeline
variables. (You may need to import the modules you use in your pipeline definition.)
project
is an instance of seesaw.project.Project
that defines the title of the project, a short description with an optional project logo and an optional deadline. The information will be shown in the web interface when the project is running.
Example:
project = Project(
title = "Example project",
project_html = """
<img class="project-logo" alt="Project logo" src="http://archive.org/images/glogo.png" height="50px" />
<h2>Example project <span class="links"><a href="http://example.com/">Example website</a> · <a href="http://example.heroku.com/">Leaderboard</a></span></h2>
<p>This is an example project. Under a logo and title there's some room for extra information.</p>
""",
utc_deadline = datetime.datetime(2013,1,1, 12,0,0)
)
pipeline
is an instance of seesaw.pipeline.Pipeline
that defines the actual sequence of tasks that should be executed for each item. It is possible to implement your own tasks (one of the simplest examples is CheckIP
), but for most common tasks you can use one of the predefined implementations.
Example:
pipeline = Pipeline(
GetItemFromTracker("http://tracker.archiveteam.org/example", downloader),
WgetDownload([ "./wget-warc-lua",
"-U", USER_AGENT,
"-nv",
"-o", ItemInterpolation("%(item_dir)s/wget.log"),
"--lua-script", "picplz-user.lua",
"--directory-prefix", ItemInterpolation("%(item_dir)s/files"),
"--force-directories",
"-e", "robots=off",
"--page-requisites", "--span-hosts",
"--warc-file", ItemInterpolation("%(item_dir)s/%(warc_file_base)s"),
"--warc-header", "operator: Archive Team",
"--warc-header", "picplz-dld-script-version: " + VERSION,
"--warc-header", ItemInterpolation("picplz-user-id: %(item_name)s"),
ItemInterpolation("http://api.picplz.com/api/v2/user.json?id=%(item_name)s&include_detail=1&include_pics=1&pic_page_size=100")
],
max_tries = 2,
accept_on_exit_code = [ 0, 8 ],
env = { "picplz_lua_json": ItemInterpolation("%(item_dir)s/%(warc_file_base)s.json") })
),
RsyncUpload(
target = "fos.textfiles.com::picplz/%s/" % downloader,
target_source_path = "data/",
files = [
ItemInterpolation("%(prefix_dir)s/%(warc_file_base)s.warc.gz"),
ItemInterpolation("%(prefix_dir)s/%(warc_file_base)s.json")
],
bwlimit=ConfigValue(name="Rsync bwlimit", default="0")
),
SendDoneToTracker(
tracker_url = "http://tracker.archiveteam.org/example",
stats = ItemValue("stats")
)
)
To run your pipeline, you install the seesaw kit and run the pipeline file with the run-pipeline
command.
The ArchiveTeam Warrior can run seesaw kit pipelines. The warrior needs a few extras besides the pipeline file:
- The project should be in a public Git repository (e.g., on GitHub). The warrior will clone this repository before running the project.
- The pipeline definition should be in a file called
pipeline.py
in the project's root directory. - If your project has any extra requirements you should create an executable
warrior-install.sh
Bash script that installs your dependencies. You may usesudo
to perform the installations. - Your project pipeline should download files to the
data
subdirectory (which is mapped to a special drive in the warrior VM).
See the example-seesaw-project for a (very simple) example.