Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create WARC files also, besides the XZ files for the urlshorteners, to make the archived url shorteners available through the wayback machine #1

Open
Arkiver2 opened this issue Aug 31, 2014 · 2 comments

Comments

@Arkiver2
Copy link
Member

It would be very useful if warc.gz files are also made for the url shorteners we are archiving.
The chance of people looking in the wayback machine for an url (shortener) is probably bigger then the chance of looking through the .xz files for the shortener they are looking for.

@whs
Copy link
Contributor

whs commented Sep 3, 2014

Rough design idea:

  • Send and store everything in base64 to prevent encoding issues
  • Write a single megawarc for each shorteners on export
  • --disable-warc argument to disable export

Which warc library should I use? IA's warc seems to be incompatible with Python 3

@chfoo
Copy link
Member

chfoo commented Sep 3, 2014

If you want to record as WARC files easily, you'll need an agent that supports recording HTTP traffic accurately to WARC files. Some example agents include Heritrix, Wget, and Wpull but these are web crawlers.

If you can get raw HTTP request and responses from Python Requests, then you try to build a WARC file yourself. I wrote a WARC library called Warcat which is supported under Python 3. I also wrote Wpull which runs under Python 3 and maybe you can take code from it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants