Videos on Twitter captures #198

laurelin88 · 2020-05-20T14:30:25Z

Hi,

Before I describe the issue, I will preface by saying I am very new to brozzler and similar tools in general, so perhaps my question is a little bit simplistic.
Anyways, I was wondering if you have any pointers as to why videos on some hashtag feeds I captured do not seem to play when I view them on pywb. Is there any configuration I could make to possibly solve this, for Twitter or for other social media platforms/websites?
Thank you!

galgeek · 2020-05-20T17:04:52Z

brozzler depends on youtube-dl for much video capture, so make sure that your youtube-dl install is up to date (it's updated pretty frequently). You can update your brozzler virtualenv with pip install -U youtube-dl.

twitter has recently updated video and hashtag code, and we're actively working on improving capture.

laurelin88 · 2020-06-17T15:40:14Z

Thank you for your response and apologies for my late reply - I did indeed update youtube-dl but the problem seems to persist.
In fact, after checking the WARC file with the ArchiveTools warc-extractor, it turns out that the file dump from the WARC does contain TS video files that are accessible, but they are not replayable from inside the WARC itself, e.g. with pywb.

galgeek · 2020-06-17T22:10:23Z

What's the twitter url you're trying to capture?

anjackson · 2020-06-18T07:09:17Z

I'll gladly be corrected on this, but AFAIK right now there is no openly-available web archive playback system that can play the videos captured in this way.

The pywb stack massages the messages between the client and server so that playback works without additional metadata records. The approach used here (capturing the videos with youtube-dl and storing a JSON metadata record that links the source page to the videos) requires an additional step which is only just now being finalised in pywb, and will require a little more work at the indexing stage to make it work (mapping metdata:... records to urn:embeds:...). /cc @ikreymer

galgeek · 2020-06-18T18:25:30Z

brozzler with youtube-dl currently captures mp4s for at least some twitter video. (Whether and when youtube-dl captures video from a site can depend on the format of the upload, as well as the site's video hosting pipeline.) Here's one that I worked on recently:
https://wayback.qa-archive-it.org/12058/20200601214942/https://video.twimg.com/ext_tw_video/1056575394453839872/pu/vid/1280x720/dhWsaVXAvomMyBG-.mp4?tag=5

brozzler often directly captures the initial segments of media that's delivered in segments, which may be how @laurelin88's TS video files were captured. It's true that these are a challenge to replay.

ikreymer · 2020-06-19T05:15:58Z

Yes, as @anjackson mentions, with the youtube-dl approach, it would be possible to read the youtube-dl JSON to determine the URLs of videos download via youtube-dl. pywb is now supporting urn:embeds: for a more generic JSON embeds format, while brozzler is saving the youtube-dl files as youtube-dl:<id>:<url>.

It would be possible to support youtube-dl:... lookup in pywb also, but then will still need to determine where the video should go on the page.. if more than one video, it's may not be possible to guess.., so playback of video may not always work..

Fortunately, there is also alternative solution, and pywb has been supporting for a while, with HTML5 video, and does not involve youtube-dl at all.

When encountering an HLS or DASH manifest, it is possible to rewrite it at capture time, so that only one resolution is available.. For example, given an HLS manifest (.m3u8) file that looks like:

#EXTM3U
#EXT-X-MEDIA:TYPE=SUBTITLES,GROUP-ID="WebVTT",NAME="English",DEFAULT=YES,AUTOSELECT=YES,FORCED=NO,URI="https://example.com/subtitles/"
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=610000,RESOLUTION=640x360,CODECS="avc1.66.30, mp4a.40.2",SUBTITLES="WebVTT"
http://example.com/video_1.m3u8
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=416000,RESOLUTION=400x224,CODECS="avc1.66.30, mp4a.40.2",SUBTITLES="WebVTT"
http://example.com/video_2.m3u8
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=797000,RESOLUTION=640x360,CODECS="avc1.66.30, mp4a.40.2",SUBTITLES="WebVTT"
http://example.com/video_3.m3u8
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=1002000,RESOLUTION=640x360,CODECS="avc1.77.30, mp4a.40.2",SUBTITLES="WebVTT"
http://example.com/video_4.m3u8
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=2505000,RESOLUTION=1280x720,CODECS="avc1.77.30, mp4a.40.2",SUBTITLES="WebVTT"
http://example.com/video_5.m3u8
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=4495000,RESOLUTION=1920x1080,CODECS="avc1.640028, mp4a.40.2",SUBTITLES="WebVTT"
http://example.com/video_6.m3u8
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=38000,CODECS="mp4a.40.2",SUBTITLES="WebVTT"
http://example.com/audio_0.m3u8

The rewriting simply removes all resolutions except desired one to be captured. The desired resolution can be highest up to a max, so maybe 1920x1080 is too much, and the second-highest one is chosen.
The file served to the browser than looks like this (while the original is still written to WARC):

#EXTM3U
#EXT-X-MEDIA:TYPE=SUBTITLES,GROUP-ID="WebVTT",NAME="English",DEFAULT=YES,AUTOSELECT=YES,FORCED=NO,URI="https://example.com/subtitles/"
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=2505000,RESOLUTION=1280x720,CODECS="avc1.77.30, mp4a.40.2",SUBTITLES="WebVTT"
http://example.com/video_5.m3u8

Then, on replay, if the same rewriting is applied, all the chunks of the video will play back, as only one set resolution is available.
This makes the videos on Twitter (and many other sites that use HLS) will work. DASH is a similar XML based format
that allows for this type of filtering.

imo this little bit of rewriting/filtering at capture time is a useful tradeoff as it avoids all the complexity of youtube-dl, extra index, and replay-time video index mapping, and results in working video replay. main downside is the video is archived in chunks, rather than as one record/stream.

ikreymer mentioned this issue Jun 10, 2020

JavaScript files harvested as partial content (HTTP 206) break playback #201

Open

nvanderperren mentioned this issue Nov 17, 2020

Images on Instagram and Twitter captures not shown in pywb #215

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Videos on Twitter captures #198

Videos on Twitter captures #198

laurelin88 commented May 20, 2020

galgeek commented May 20, 2020 •

edited

Loading

laurelin88 commented Jun 17, 2020

galgeek commented Jun 17, 2020

anjackson commented Jun 18, 2020

galgeek commented Jun 18, 2020

ikreymer commented Jun 19, 2020 •

edited

Loading

Videos on Twitter captures #198

Videos on Twitter captures #198

Comments

laurelin88 commented May 20, 2020

galgeek commented May 20, 2020 • edited Loading

laurelin88 commented Jun 17, 2020

galgeek commented Jun 17, 2020

anjackson commented Jun 18, 2020

galgeek commented Jun 18, 2020

ikreymer commented Jun 19, 2020 • edited Loading

galgeek commented May 20, 2020 •

edited

Loading

ikreymer commented Jun 19, 2020 •

edited

Loading