Skip to content

Defining a seesaw warrior project

alard edited this page Aug 3, 2012 · 4 revisions

Here's how to use the seesaw kit for your own project and how to prepare your project for the ArchiveTeam Warrior.

Overview

Concept

We assume that your project has a number of small tasks that you want completed (e.g., downloading users) and that you have a way to hand out tasks to individual users and record its progress (i.e., you have a tracker).

You should define a pipeline, the sequence of steps that complete a task. Your pipeline will probably be something like this:

  1. Request an assignment from the tracker.
  2. Run Wget to download the file.
  3. Upload the downloaded file with rsync.
  4. Tell the tracker that the assignment is done.

You specify the pipeline as a Python file. When you run your project the seesaw kit interprets the file and executes the steps.

Terminology

Your pipeline contains a sequence of tasks that should be executed. When you run the seesaw script (or the warrior), the script will feed items to your pipeline. Each item represents an individual work unit (e.g., a single user). Each task will work on an item, then pass it on to the next task in the pipeline. Items have properties that can be read and updated by the tasks. For example, the GetItemFromTracker task could set the item_name property that is then used by the WgetDownload task to download a file.

Pipeline definition

Your pipeline definition is a normal Python file that initializes project and pipeline variables. (You may need to import the modules you use in your pipeline definition.)

project is an instance of seesaw.project.Project that defines the title of the project, a short description with an optional project logo and an optional deadline. The information will be shown in the web interface when the project is running.

Example:

project = Project(
  title = "Example project",
  project_html = """
    <img class="project-logo" alt="Project logo" src="http://archive.org/images/glogo.png" height="50px" />
    <h2>Example project <span class="links"><a href="http://example.com/">Example website</a> &middot; <a href="http://example.heroku.com/">Leaderboard</a></span></h2>
    <p>This is an example project. Under a logo and title there's some room for extra information.</p>
  """,
  utc_deadline = datetime.datetime(2013,1,1, 12,0,0)
)

pipeline is an instance of seesaw.pipeline.Pipeline that defines the actual sequence of tasks that should be executed for each item. It is possible to implement your own tasks (more on that later, TODO), but for most common tasks you can use one of the predefined implementations.

Example:

pipeline = Pipeline(
  GetItemFromTracker("http://tracker.archiveteam.org/example", downloader),
  WgetDownload([ "./wget-warc-lua",
      "-U", USER_AGENT,
      "-nv",
      "-o", ItemInterpolation("%(item_dir)s/wget.log"),
      "--lua-script", "picplz-user.lua",
      "--directory-prefix", ItemInterpolation("%(item_dir)s/files"),
      "--force-directories",
      "-e", "robots=off",
      "--page-requisites", "--span-hosts",
      "--warc-file", ItemInterpolation("%(item_dir)s/%(warc_file_base)s"),
      "--warc-header", "operator: Archive Team",
      "--warc-header", "picplz-dld-script-version: " + VERSION,
      "--warc-header", ItemInterpolation("picplz-user-id: %(item_name)s"),
      ItemInterpolation("http://api.picplz.com/api/v2/user.json?id=%(item_name)s&include_detail=1&include_pics=1&pic_page_size=100")
    ],
    max_tries = 2,
    accept_on_exit_code = [ 0, 8 ],
    env = { "picplz_lua_json": ItemInterpolation("%(item_dir)s/%(warc_file_base)s.json") })
  ),
  RsyncUpload(
    target = "fos.textfiles.com::picplz/%s/" % downloader,
    target_source_path = "data/",
    files = [
      ItemInterpolation("%(prefix_dir)s/%(warc_file_base)s.warc.gz"),
      ItemInterpolation("%(prefix_dir)s/%(warc_file_base)s.json")
    ],
    bwlimit=ConfigValue(name="Rsync bwlimit", default="0")
  ),
  SendDoneToTracker(
    tracker_url = "http://tracker.archiveteam.org/example",
    stats = ItemValue("stats")
  )
)

To run your pipeline, you install the seesaw kit and run the pipeline file with the run-pipeline command.

Warrior projects

The ArchiveTeam Warrior can run seesaw kit pipelines. The warrior needs a few extras besides the pipeline file:

  1. The project should be in a public Git repository (e.g., on GitHub). The warrior will clone this repository before running the project.
  2. The pipeline definition should be in a file called pipeline.py in the project's root directory.
  3. If your project has any extra requirements you should create an executable warrior-install.sh Bash script that installs your dependencies. You may use sudo to perform the installations.
  4. Your project pipeline should download files to the data subdirectory (which is mapped to a special drive in the warrior VM).

See the example-seesaw-project for a (very simple) example.

Clone this wiki locally