Cache subtitles in S3 storage #287

dan-niles · 2024-08-04T07:10:44Z

This PR modifies the download_subtitles method to cache subtitles in S3. The modified method now works as follows:

Fetch requested subtitle keys using yt-dlp (e.g., en, fr, de) and store them in requested_subtitle_keys.
Iterate through the requested_subtitle_keys and attempt to download each subtitle file from the S3 cache. If a file is - successfully downloaded from the S3 cache, remove the corresponding key from requested_subtitle_keys.
For the remaining keys in requested_subtitle_keys, download the subtitles using yt-dlp.
Upload the newly downloaded subtitles to the S3 cache.
Save the information about the fetched subtitles in a local cache as a JSON file for future use.
Add the downloaded subtitles to the ZIM file.

Close #277

codecov · 2024-08-04T07:15:02Z

Codecov Report

Attention: Patch coverage is 0% with 34 lines in your changes missing coverage. Please review.

Project coverage is 1.50%. Comparing base (60b85b5) to head (d34a734).

Files	Patch %	Lines
scraper/src/youtube2zim/scraper.py	0.00%	34 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff            @@
##            main    #287      +/-   ##
========================================
- Coverage   1.54%   1.50%   -0.05%     
========================================
  Files         11      11              
  Lines       1102    1132      +30     
  Branches     162     170       +8     
========================================
  Hits          17      17              
- Misses      1085    1115      +30

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

benoit74

I'm a bit puzzled about this issue/PR now.

Original goal I expressed in the issue was to avoid being blocked by yt-dlp ban when subtitles did not changed. Goal is not reached in this PR since we still need to get the list of subtitles with yt-dlp.

At the same time, caching as well the list of subtitles is probably not something wishable, we usually do not cache in S3 the responses to API calls, but rather resources which take time/resources to recompute (reencoded videos / images). The idea of caching subtitles in S3 was probably already a deviation from this usual behavior.

To help sort this out, can you please share some data about how this change makes the scraper run faster or not?

dan-niles · 2024-08-05T11:12:02Z

Original goal I expressed in the issue was to avoid being blocked by yt-dlp ban when subtitles did not changed.

The reason I used yt-dlp to get the list of subtitles is that the scraper doesn't know which language subtitles need to be downloaded. Without the --all-subtitles flag, only the default available subtitles for the video are downloaded. With the --all-subtitles flag, auto-generated subtitles are included as well.

To avoid calling yt-dlp entirely in this scenario, we could save two zipped files in the S3 cache:

subtitles/{video_id}/default.zip - containing only the default subtitles.
subtitles/{video_id}/all.zip - containing all subtitles when --all-subtitles is passed.

WDYT?

benoit74 · 2024-08-05T11:38:13Z

This seems a potential improvement, but does it really help the scraper run faster? Because it has the drawback that we do not know when we should invalidate these to update them.

benoit74 · 2024-08-05T12:27:21Z

Let's pause this issue/PR to let me reflect a bit on this

Update download_subtitles method to cache subtitles in S3 storage

d34a734

dan-niles self-assigned this Aug 4, 2024

dan-niles marked this pull request as ready for review August 4, 2024 07:14

dan-niles requested a review from benoit74 August 4, 2024 07:14

benoit74 reviewed Aug 5, 2024

View reviewed changes

dan-niles marked this pull request as draft August 5, 2024 12:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache subtitles in S3 storage #287

Cache subtitles in S3 storage #287

dan-niles commented Aug 4, 2024

codecov bot commented Aug 4, 2024

benoit74 left a comment

dan-niles commented Aug 5, 2024

benoit74 commented Aug 5, 2024

benoit74 commented Aug 5, 2024

Cache subtitles in S3 storage #287

Are you sure you want to change the base?

Cache subtitles in S3 storage #287

Conversation

dan-niles commented Aug 4, 2024

codecov bot commented Aug 4, 2024

Codecov Report

benoit74 left a comment

Choose a reason for hiding this comment

dan-niles commented Aug 5, 2024

benoit74 commented Aug 5, 2024

benoit74 commented Aug 5, 2024