Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache subtitles in S3 storage #287

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Changed
- Disable preloading of subtitles in video.js in `zimui` (#38)
- Update `download_subtitles` method to cache subtitles in S3 storage (#277)

## [3.0.0] - 2024-07-29

Expand Down
59 changes: 54 additions & 5 deletions scraper/src/youtube2zim/scraper.py
Original file line number Diff line number Diff line change
Expand Up @@ -882,21 +882,70 @@
def download_subtitles(self, video_id, options):
"""download subtitles for a video"""

def get_subtitle_s3_key(code: str) -> str:
return f"subtitles/{video_id}/subtitle.{code}.vtt"

Check warning on line 886 in scraper/src/youtube2zim/scraper.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/youtube2zim/scraper.py#L885-L886

Added lines #L885 - L886 were not covered by tests

def get_subtitle_path(code: str) -> str:
return options["y2z_videos_dir"].joinpath(f"{video_id}/video.{code}.vtt")

Check warning on line 889 in scraper/src/youtube2zim/scraper.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/youtube2zim/scraper.py#L888-L889

Added lines #L888 - L889 were not covered by tests

options_copy = options.copy()
options_copy.update({"skip_download": True, "writethumbnail": False})
options_copy.update(

Check warning on line 892 in scraper/src/youtube2zim/scraper.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/youtube2zim/scraper.py#L892

Added line #L892 was not covered by tests
{"skip_download": True, "writethumbnail": False, "listsubs": True}
)

# Fetch the list of requested subtitles
try:
with yt_dlp.YoutubeDL(options_copy) as ydl:
ydl.download([video_id])
info = ydl.extract_info(video_id, download=False)
requested_subtitles = (

Check warning on line 900 in scraper/src/youtube2zim/scraper.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/youtube2zim/scraper.py#L899-L900

Added lines #L899 - L900 were not covered by tests
info.get("requested_subtitles", {}) if info else None
)
if not requested_subtitles:
return True
requested_subtitle_keys = list(requested_subtitles.keys())
except Exception as e:
logger.error(f"Could not fetch subtitles for {video_id}: {e}")
return False

Check warning on line 908 in scraper/src/youtube2zim/scraper.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/youtube2zim/scraper.py#L904-L908

Added lines #L904 - L908 were not covered by tests

# Download subtitles from cache if available
if self.s3_storage:
for subtitle_key in requested_subtitles:
subtitle_path = get_subtitle_path(subtitle_key)
s3_key = get_subtitle_s3_key(subtitle_key)
logger.debug(

Check warning on line 915 in scraper/src/youtube2zim/scraper.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/youtube2zim/scraper.py#L913-L915

Added lines #L913 - L915 were not covered by tests
f"Attempting to download subtitles for {video_id} from cache..."
)
if self.download_from_cache(s3_key, subtitle_path, ""):
requested_subtitle_keys.remove(subtitle_key)

Check warning on line 919 in scraper/src/youtube2zim/scraper.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/youtube2zim/scraper.py#L919

Added line #L919 was not covered by tests

# Download subtitles using yt-dlp
try:

Check warning on line 922 in scraper/src/youtube2zim/scraper.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/youtube2zim/scraper.py#L922

Added line #L922 was not covered by tests
if len(requested_subtitle_keys) > 0:
options_copy.update(

Check warning on line 924 in scraper/src/youtube2zim/scraper.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/youtube2zim/scraper.py#L924

Added line #L924 was not covered by tests
{"sublangs": requested_subtitle_keys, "listsubs": False}
)
with yt_dlp.YoutubeDL(options_copy) as ydl:
ydl.download([video_id])
except Exception:
logger.error(f"Could not download subtitles for {video_id}")

Check warning on line 930 in scraper/src/youtube2zim/scraper.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/youtube2zim/scraper.py#L928-L930

Added lines #L928 - L930 were not covered by tests
else:
# upload to cache only if everything went well
if self.s3_storage:
for subtitle_key in requested_subtitle_keys:
subtitle_path = get_subtitle_path(subtitle_key)
s3_key = get_subtitle_s3_key(subtitle_key)
logger.debug(f"Uploading subtitle for {video_id} to cache ...")
self.upload_to_cache(s3_key, subtitle_path, "")

Check warning on line 938 in scraper/src/youtube2zim/scraper.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/youtube2zim/scraper.py#L935-L938

Added lines #L935 - L938 were not covered by tests

# save subtitle keys to local cache for generating JSON files later
subtitles_list = self.fetch_video_subtitles_list(video_id)
# save subtitles to cache for generating JSON files later
save_json(
self.subtitles_cache_dir,
video_id,
subtitles_list.dict(by_alias=True),
)
self.add_video_subtitles_to_zim(video_id)
except Exception:
logger.error(f"Could not download subtitles for {video_id}")
return True

Check warning on line 948 in scraper/src/youtube2zim/scraper.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/youtube2zim/scraper.py#L948

Added line #L948 was not covered by tests

def download_video_files_batch(self, options, videos_ids):
"""download video file and thumbnail for all videos in batch
Expand Down