Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New request: (A few playlists) of Youtube Channel "C'est pas sorcier" #1182

Open
kelson42 opened this issue Oct 12, 2024 · 14 comments
Open

New request: (A few playlists) of Youtube Channel "C'est pas sorcier" #1182

kelson42 opened this issue Oct 12, 2024 · 14 comments
Assignees
Labels
Youtube for zim requests that are Youtube-related

Comments

@kelson42
Copy link
Collaborator

The full channel is already available but is huge. I would like to be able to distribute only part of the channel, and being able to have only a few playlist seems to be the most easy way to do that.

@kelson42 kelson42 self-assigned this Oct 12, 2024
@kelson42 kelson42 added the Youtube for zim requests that are Youtube-related label Oct 12, 2024
@kelson42
Copy link
Collaborator Author

@benoit74 I wanted to scrape https://www.youtube.com/watch?v=Ofa1OV6d5xc&list=PLh-qVJTuss13TJpf3Fdd8cbLaPbT18a9Z at https://farm.openzim.org/recipes/cest-pas-sorcier_fr_astronomie and I failed twice. It seems I'm not able to gather/configure properly the playlist id. The FAQ does also not explain how to retrieve the ID in case of a playlist. Do I just have done something wrong or is something unclear or maybe even buggy?

@benoit74
Copy link
Contributor

The FAQ does also not explain how to retrieve the ID in case of a playlist.

See https://github.com/openzim/youtube/wiki/FAQ---FEE#how-do-i-find-a-channel--user--handle-technical-id (I promise it is here since Friday, and it was even here before but a bit harder to find I have to admit)

Do I just have done something wrong or is something unclear or maybe even buggy?

In https://farm.openzim.org/pipeline/0409f83a-76b0-476e-a552-971fbdf79e97/debug you should have set "type" to "playlist". I fixed this and requested recipe again. I also opened openzim/youtube#361

@benoit74
Copy link
Contributor

This is now mostly working, but a new bug in Youtube scraper appeared: openzim/youtube#362

@benoit74
Copy link
Contributor

We have a working ZIM as can be seen at https://dev.library.kiwix.org/#lang=&tag=&q=Magazine (build with dev version of youtube scraper, we should probably wait for the release to push to prod).

Note that this ZIM is impacted by a bug at the library / library generation / scraper level around tags: kiwix/operations#286

@benoit74
Copy link
Contributor

@kelson42 I let you continue with other playlists you wanted to create?

@kelson42
Copy link
Collaborator Author

@kelson42 I let you continue with other playlists you wanted to create?

May I wait to openzim/youtube#369 to be fixed? Or is that not recommend by you?

@kelson42
Copy link
Collaborator Author

kelson42 commented Oct 19, 2024

The FAQ does also not explain how to retrieve the ID in case of a playlist.

See https://github.com/openzim/youtube/wiki/FAQ---FEE#how-do-i-find-a-channel--user--handle-technical-id (I promise it is here since Friday, and it was even here before but a bit harder to find I have to admit)

@benoit74 Do you mean https://github.com/openzim/youtube/wiki/Frequently-Asked-Questions#how-do-i-find-a-channel--user--handle-technical-id ? This does not speak of "playlist"... and if it does, then this is unclear from a user perspective if this concerns playlist

On the top of it: If I looks to the source of https://www.youtube.com/playlist?list=PLh-qVJTuss13TJpf3Fdd8cbLaPbT18a9Z, there is nothing like this:

$ curl -s "https://www.youtube.com/playlist?list=PLh-qVJTuss13TJpf3Fdd8cbLaPbT18a9Z" | grep 'itemprop="identifier"'
$echo $?
$1

@kelson42
Copy link
Collaborator Author

@kelson42 I let you continue with other playlists you wanted to create?

May I wait to openzim/youtube#369 to be fixed? Or is that not recommend by you?

I'm halfway (50%) of all playlists.

@benoit74
Copy link
Contributor

I've added https://github.com/openzim/youtube/wiki/Frequently-Asked-Questions#how-do-i-find-a-playlist-id for playlists, didn't realized at that time you were speaking about playlist, read your comment too fast, sorry.

I'm halfway (50%) of all playlists.

Cool

May I wait to openzim/youtube#369 to be fixed? Or is that not recommend by you?

Sure. Will be done somewhere this week at the latest, but setting up the recipes and checking they work as expected is not wasting Zimfarm time, especially since the Zimfarm is mostly empty ATM.

@kelson42
Copy link
Collaborator Author

kelson42 commented Oct 26, 2024

@benoit74 I have creates all recipes (one per playlist). This is ready to review. See https://farm.openzim.org/recipes?name=sorcier

@benoit74
Copy link
Contributor

I don't get what is the reasoning behind title / description / long description metadata which seems to follow more or less a pattern but still be quite different from time to time, or even a bit inconsistent from time to time (or I miss the logic, e.g why Geologie is mostly the only ZIM with a nice / funny description).

Can you explain how you've reasoned about it?

And more precisely, why do you consider that:

  • using the same very generic description is OK (Magazine télévisuel de vulgarisation scientifique destiné aux enfants is used over and over)
  • using only a partial description de C'est pas sorcier is OK? I understand it is meant as a complement to the title, but are we sure all readers are displaying the title and the description in right disposition for the combination to make sense?
  • relying a lot on the long description to provide sense is OK? (it is not even yet displayed on our most official reader kiwix-serve)

Something like this card make me think something is broken or at least very odd (upper case in the middle of a string without upper case at the beginning):

Image

And finally, I have only 17 ZIMs (instead of 20) with https://dev.library.kiwix.org/#lang=&q=sorcier, I don't get why.

From my perspective so far, title and description are meant to:

  • describe what the ZIM is about, especially the content creator / source (C'est pas sorcier)
  • differentiate clearly the various ZIMs we might have from same content provider
  • be both individually self-contained (i.e. if one skim the title and read only the description, he understand what he will get)
  • be clear descriptions of what is inside the ZIM (here I miss the fact that these are selections / subsets of few videos of the TV show around a particular topic)

This is at least what guided us in https://library.kiwix.org/#lang=eng&q=ted (where I have to admit we've been lucky to have "TED" and nothing longer). Here I feel like we are realizing that the 30 and 80 chars limits are too short and we are fighting against them with (a bit ugly) hacks.

@kelson42
Copy link
Collaborator Author

And finally, I have only 17 ZIMs (instead of 20) with https://dev.library.kiwix.org/#lang=&q=sorcier, I don't get why.

My best guess: this is a consequence of having recipes of the Zimfarm with the same ZIM metadata "Name". At least was the case at some point. Somehow this has impact on the script building the library of dev.library.kiwix.org.

This is why i called the impact "vicious" and requested to act quickly on this.

@kelson42
Copy link
Collaborator Author

kelson42 commented Nov 2, 2024

@benoit74 I have modified/fixed all the recipes according do your remarks and our discussion. Unfortunately it seems latest version of Youtube scraper fails! None of them seem to pass anymore.

@benoit74
Copy link
Contributor

benoit74 commented Nov 3, 2024

As discussed on Slack, this has nothing to do with youtube scraper, but a Zimfarm bug openzim/zimfarm#668

Please replace " quotes with « and » to avoid waiting too long for bug to be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Youtube for zim requests that are Youtube-related
Projects
None yet
Development

No branches or pull requests

2 participants