Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Videos on Twitter captures #198

Open
laurelin88 opened this issue May 20, 2020 · 6 comments
Open

Videos on Twitter captures #198

laurelin88 opened this issue May 20, 2020 · 6 comments

Comments

@laurelin88
Copy link

Hi,

Before I describe the issue, I will preface by saying I am very new to brozzler and similar tools in general, so perhaps my question is a little bit simplistic.
Anyways, I was wondering if you have any pointers as to why videos on some hashtag feeds I captured do not seem to play when I view them on pywb. Is there any configuration I could make to possibly solve this, for Twitter or for other social media platforms/websites?
Thank you!

@galgeek
Copy link
Contributor

galgeek commented May 20, 2020

brozzler depends on youtube-dl for much video capture, so make sure that your youtube-dl install is up to date (it's updated pretty frequently). You can update your brozzler virtualenv with pip install -U youtube-dl.

twitter has recently updated video and hashtag code, and we're actively working on improving capture.

@laurelin88
Copy link
Author

Thank you for your response and apologies for my late reply - I did indeed update youtube-dl but the problem seems to persist.
In fact, after checking the WARC file with the ArchiveTools warc-extractor, it turns out that the file dump from the WARC does contain TS video files that are accessible, but they are not replayable from inside the WARC itself, e.g. with pywb.

@galgeek
Copy link
Contributor

galgeek commented Jun 17, 2020

What's the twitter url you're trying to capture?

@anjackson
Copy link

I'll gladly be corrected on this, but AFAIK right now there is no openly-available web archive playback system that can play the videos captured in this way.

The pywb stack massages the messages between the client and server so that playback works without additional metadata records. The approach used here (capturing the videos with youtube-dl and storing a JSON metadata record that links the source page to the videos) requires an additional step which is only just now being finalised in pywb, and will require a little more work at the indexing stage to make it work (mapping metdata:... records to urn:embeds:...). /cc @ikreymer

@galgeek
Copy link
Contributor

galgeek commented Jun 18, 2020

brozzler with youtube-dl currently captures mp4s for at least some twitter video. (Whether and when youtube-dl captures video from a site can depend on the format of the upload, as well as the site's video hosting pipeline.) Here's one that I worked on recently:
https://wayback.qa-archive-it.org/12058/20200601214942/https://video.twimg.com/ext_tw_video/1056575394453839872/pu/vid/1280x720/dhWsaVXAvomMyBG-.mp4?tag=5

brozzler often directly captures the initial segments of media that's delivered in segments, which may be how @laurelin88's TS video files were captured. It's true that these are a challenge to replay.

@ikreymer
Copy link

ikreymer commented Jun 19, 2020

Yes, as @anjackson mentions, with the youtube-dl approach, it would be possible to read the youtube-dl JSON to determine the URLs of videos download via youtube-dl. pywb is now supporting urn:embeds: for a more generic JSON embeds format, while brozzler is saving the youtube-dl files as youtube-dl:<id>:<url>.

It would be possible to support youtube-dl:... lookup in pywb also, but then will still need to determine where the video should go on the page.. if more than one video, it's may not be possible to guess.., so playback of video may not always work..

Fortunately, there is also alternative solution, and pywb has been supporting for a while, with HTML5 video, and does not involve youtube-dl at all.

When encountering an HLS or DASH manifest, it is possible to rewrite it at capture time, so that only one resolution is available.. For example, given an HLS manifest (.m3u8) file that looks like:

#EXTM3U
#EXT-X-MEDIA:TYPE=SUBTITLES,GROUP-ID="WebVTT",NAME="English",DEFAULT=YES,AUTOSELECT=YES,FORCED=NO,URI="https://example.com/subtitles/"
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=610000,RESOLUTION=640x360,CODECS="avc1.66.30, mp4a.40.2",SUBTITLES="WebVTT"
http://example.com/video_1.m3u8
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=416000,RESOLUTION=400x224,CODECS="avc1.66.30, mp4a.40.2",SUBTITLES="WebVTT"
http://example.com/video_2.m3u8
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=797000,RESOLUTION=640x360,CODECS="avc1.66.30, mp4a.40.2",SUBTITLES="WebVTT"
http://example.com/video_3.m3u8
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=1002000,RESOLUTION=640x360,CODECS="avc1.77.30, mp4a.40.2",SUBTITLES="WebVTT"
http://example.com/video_4.m3u8
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=2505000,RESOLUTION=1280x720,CODECS="avc1.77.30, mp4a.40.2",SUBTITLES="WebVTT"
http://example.com/video_5.m3u8
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=4495000,RESOLUTION=1920x1080,CODECS="avc1.640028, mp4a.40.2",SUBTITLES="WebVTT"
http://example.com/video_6.m3u8
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=38000,CODECS="mp4a.40.2",SUBTITLES="WebVTT"
http://example.com/audio_0.m3u8

The rewriting simply removes all resolutions except desired one to be captured. The desired resolution can be highest up to a max, so maybe 1920x1080 is too much, and the second-highest one is chosen.
The file served to the browser than looks like this (while the original is still written to WARC):

#EXTM3U
#EXT-X-MEDIA:TYPE=SUBTITLES,GROUP-ID="WebVTT",NAME="English",DEFAULT=YES,AUTOSELECT=YES,FORCED=NO,URI="https://example.com/subtitles/"
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=2505000,RESOLUTION=1280x720,CODECS="avc1.77.30, mp4a.40.2",SUBTITLES="WebVTT"
http://example.com/video_5.m3u8

Then, on replay, if the same rewriting is applied, all the chunks of the video will play back, as only one set resolution is available.
This makes the videos on Twitter (and many other sites that use HLS) will work. DASH is a similar XML based format
that allows for this type of filtering.

imo this little bit of rewriting/filtering at capture time is a useful tradeoff as it avoids all the complexity of youtube-dl, extra index, and replay-time video index mapping, and results in working video replay. main downside is the video is archived in chunks, rather than as one record/stream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants