Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct Language codes in Gutenberg recipes #217

Open
RavanJAltaie opened this issue Feb 22, 2024 · 5 comments
Open

Correct Language codes in Gutenberg recipes #217

RavanJAltaie opened this issue Feb 22, 2024 · 5 comments
Labels

Comments

@RavanJAltaie
Copy link

RavanJAltaie commented Feb 22, 2024

For Gutenberg, we use the "one-language-one-zim" mode in Zimfarm. In this mode, the language is set automatically by the scraper. Obviously the scraper is creating ZIMs with improper language => open upstream issue in Gutenberg scraper, nothing you can solve yourself.

there are two issues:

  • openZIM:gutenberg_mul_all is improper ZIM name, mul language is not a valid ISO-639-3 language code
  • openZIM:gutenberg_rmr_all is improper ZIM name, rmr language is not a valid ISO-639-3 language code anymore ; as of 2010-01-18, [rmr] for Caló is deprecated due to split. split into Caló [rmq] and Erromintxela [emx]

Edit:

  • openZIM:gutenberg_mul_all:
    • ZIM name is OK
    • ZIm filename is OK
    • ZIM language is KO because mul language is not a valid ISO-639-3 language code, it must be a csv list of ISO-639-3 sorted by importance (so number of entries here)
  • openZIM:gutenberg_rmr_all:
    • rmr language is not a valid ISO-639-3 language code anymore ; as of 2010-01-18, [rmr] for Caló is deprecated due to split. split into Caló [rmq] and Erromintxela [emx]
    • ZIM name must be updated (to rmq probably)
    • ZIM filename also
    • ZIM language must be updated as well, could be rmq or rmq,emx
    • might be solved upstream (Gutenberg)
@eshellman
Copy link
Collaborator

I can see about Caló (it's only one book) from upstream, but none of the others are language codes from PG, that I know of.

@benoit74
Copy link
Collaborator

Thank you @eshellman, if you could fix rmr upstream it would be great ; otherwise we would have to add a "hack" to our scraper to transform rmr into rmq,emx since it's probably the real situation, or maybe only rmq

mul is a hack for the ZIM we create with all languages. The scraper should not do that to respect openZIM specification, and list all languages. This part is for us ^^

@rgaudin
Copy link
Member

rgaudin commented Feb 22, 2024

@benoit74 Languages metadata must be a list of ISO-639-3 sorted by importance (so number of entries here) but the Name metadata and the filename will keep the mul.

@benoit74
Copy link
Collaborator

Languages metadata must be a list of ISO-639-3 sorted by importance (so number of entries here) but the Name metadata and the filename will keep the mul.

Yep, I had this in mind. Thank you for confirming before I even asked 😄

@benoit74
Copy link
Collaborator

(and sorry for the wrong description in first comment, I wrote it too fast)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants