Create WARC files also, besides the XZ files for the urlshorteners, to make the archived url shorteners available through the wayback machine #1

Arkiver2 · 2014-08-31T12:52:32Z

It would be very useful if warc.gz files are also made for the url shorteners we are archiving.
The chance of people looking in the wayback machine for an url (shortener) is probably bigger then the chance of looking through the .xz files for the shortener they are looking for.

whs · 2014-09-03T03:34:23Z

Rough design idea:

Send and store everything in base64 to prevent encoding issues
Write a single megawarc for each shorteners on export
--disable-warc argument to disable export

Which warc library should I use? IA's warc seems to be incompatible with Python 3

chfoo · 2014-09-03T03:54:29Z

If you want to record as WARC files easily, you'll need an agent that supports recording HTTP traffic accurately to WARC files. Some example agents include Heritrix, Wget, and Wpull but these are web crawlers.

If you can get raw HTTP request and responses from Python Requests, then you try to build a WARC file yourself. I wrote a WARC library called Warcat which is supported under Python 3. I also wrote Wpull which runs under Python 3 and maybe you can take code from it.

chfoo added the enhancement label Sep 16, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create WARC files also, besides the XZ files for the urlshorteners, to make the archived url shorteners available through the wayback machine #1

Create WARC files also, besides the XZ files for the urlshorteners, to make the archived url shorteners available through the wayback machine #1

Arkiver2 commented Aug 31, 2014

whs commented Sep 3, 2014

chfoo commented Sep 3, 2014

Create WARC files also, besides the XZ files for the urlshorteners, to make the archived url shorteners available through the wayback machine #1

Create WARC files also, besides the XZ files for the urlshorteners, to make the archived url shorteners available through the wayback machine #1

Comments

Arkiver2 commented Aug 31, 2014

whs commented Sep 3, 2014

chfoo commented Sep 3, 2014