Skip to content
This repository has been archived by the owner on Nov 11, 2019. It is now read-only.

ScrapingSystem

Will Kahn-Greene edited this page May 8, 2015 · 2 revisions

Scraping System

Summary

steve is used to scrape metadata and links from lists of videos, put that data into a form for a richard instance and then push that data to a richard instance.

Previously we used the vidscraper library. That's no longer maintained and it no longer works well, so we need to implement something ourselves.

Because there are so many different kinds of video sites out there and we get data in all forms, it's important to have a flexible system that minimizes the amount of time it takes for someone to assemble video data.

This wiki page mulls over that problem domain.

Requirements

  1. steve fetch takes a url and downloads all the video metadata for videos at that url
  2. steve scrapevideo takes a url for a single video and returns the video metadata for that single video
  3. easy to generate processing pipeline which given a url will fetch the data and push the data through a series of transforms until it reaches a point where it's good for Richard
  4. easy to build new scrapers and have steve use them without installing software
  5. easy to create, reuse and maintain a set of default scrapers that come with steve

Architecture

FIXME: work through this

Clone this wiki locally