-
-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use libzim for reading ZIM contents #1160
Conversation
@Jaifroid Basic UI is done. What I wanted to know is whether there is any implementation for libzim or do I start from scratch (except for other PR). Also what functions should I implement first? |
For now we can only implement the functions that are currently exposed in the libzim bindings over at javascript/libzim, and for all intents and purposes, I believe those are full-text search (already implemented in our production code, so there is nothing to do to that at this stage), retrieving an article by path, and retrieving assets by path or URL. All this is enough to be able to read and browse the contents of a ZIM. It would be a good idea to allow developers to load (from the interface), any one of the four implementations of libzim that we have, as each will be helpful in analysing and decoding issues:
You can see an example of the kind of choice I'm suggesting over at the PWA (https://pwa.kiwix.org), where I included such a dropdown under Troubleshooting and Development in Configuration. However, that dropdown only chooses the implementation for full-text searching, it doesn't allow the WASM to be loaded for article retrieval. Your implementation should not be about search, but should instead be about using the WASM for reading articles. Needless to say, you mustn't load the WASM twice, in other words, your dropdown should control the EXISTING loading of the WASM in You can start with Service Worker mode (because that's easiest), but there is no reason why libzim can't be used in jQuery mode too, and once you've understood the logic in SW mode it should be easy to use the same calls in jQuery mode. I'm not sure I've expressed myself clearly. Let me know if I can clarify anything. |
By the way, I think the title of this PR isn't very clear. We already use WASM (and ASM) for XZ and ZSTD. This PR is about libzim, not whether it is ASM or WASM. I suggest changing the title to something like "Use libzim for reading ZIM contents" (or something similar)! |
@Jaifroid I was working on |
@Rishabhg71 You have to look at the libzim source itself here: https://github.com/openzim/libzim/tree/main/src . Basically, the bindings try to expose in JS the functions that we need from libzim. As well as the source code, there is this: https://libzim.readthedocs.io/en/stable/ Hopefully some of that helps! However, remember we've already got the most useful bindings exposed, so you don't need to replicate those. It's great if you can get your head round them, though, as at least one of them doesn't seem to have proper error catching. |
@Jaifroid I am still confused as I was before. I will try explaining with an example like there is a function called |
The PWA is an easy way to see all the URL paths in a ZIM. Open a ZIM, and in the search bar, type space + / (space followed by forward slash). That will evoke the URL index, and then you can clearly see what kind of ZIM you're dealing with, a Type 0 (articles in A/ namespace), a Type 1 (all user content in C/ namespace) or a hybrid Zimit type (with namespaces that look like C/A(... for user content, and C/H/... for response headers). If you are interested in a particular article, you can type part of its path into the search box, like As far as I know, the "path" is in fact the ZIM URL as shown in the image below. Try it and see. Note that in Wikimedia ZIMs, it's still A/..., whereas in most other newer ZIMs it should be C/... |
And did you try out the libzim test implementation linked from here: https://github.com/openzim/javascript-libzim? That's a really basic implementation that allows you to see clearly how the calls are done to the WASM version (by examining in DEV tools). It's kind of tuned for using a Ray Charles (modern) ZIM, but you can use any so long as you alter the paths. Beware that some ZIMs currently don't work for article loading due to the lack of error-handling I've mentioned before (one thing we need to fix, see openzim/javascript-libzim#64). |
dc8a27b
to
1cfd253
Compare
@Jaifroid It seems like after the latest update WASM files are not exporting the functions at all. Reverting back to this commit is working wasm is working |
OK, I'll check. I updated this branch from main, but didn't test properly, as it's an experimental branch. |
Oh, that's simply because this branch needs to be fixed as I outlined in #1169. As I wrote to you there, you need to call |
2d86213
to
e1c8733
Compare
Thanks to your advice I have a working prototype that I can now build upon. The first issue I have encountered here is only certain Zim files are being loaded properly as you can see in the video below. It seems like something related to namespace working Gutenberg has a namespace of 20231122-1918-05.5203601.mp4 |
Good news that it's (partially) working! I don't think that's an issue with the ZIMs. All ZIMs have a W namespace entry, standing for "Well-known entry", aka, pointer to the landing page. That pointer is read when the ZIM is first opened, to determine the location of the landing page directory entry. That should redirect (automatically) to something like You shouldn't have to handle any of this. It should all be transparent. It's puzzling indeed that it works for one case, but not for another. Could it be that in one case, our backend is resolving You could pause on that function in zimArchive.js to see if it's being called in some cases, but not in others, and then work back from there. |
The issue could be much simpler, actually. I just checked the gutenberg ro archive in the The fix is quite simple. But if you don't want to fix that just yet, then you'll just have to avoid testing your branch on ZIMs that produce these errors, and stick to Wikipedia ZIMs for now. |
I used the Is there a ZIM file that contains redirected content, but all the data is in English? I'd like to check if Unicode characters are causing the problem. UPDATE: I tried to send the URL-encoded string as well but that didn't work too |
If you look at the OpenZIM specification, you'll see that UTF8 is the standard and mandatory encoding for all ZIM content. Libzim support UTF8. Without debugging your code in detail, I would hazard these guesses:
I strongly suggest you take a look at, and test, your ZIMs in the libzim experiments branch, where the only errors I noticed were the ones mentioned above (Uncaught ....), and we know what the easy solution to that is. If it's working there, and not working in your branch, then the problem is with your branch... |
187a0b0
to
579476b
Compare
I finally have a basic working version where most of the zims are worker Gutenberg, StackExchange, and some of the Wikipedia zims. After checking why those zims are not working it seems like they are not hitting the Zims which are not working
When I am done with these zims I can finally add the UI controls |
That's good news! Regarding the issues you are seeing, the first one is easy to explain -- see openzim/javascript-libzim#16. Feel free to look into why we're not able to load split ZIM files: I confess I haven't taken the time to look into it, because it's not critical and we have another way to load them in our legacy backend. It could be a small issue or something more fundamental). Regarding the older wikimed, I guess that's using XZ compression rather than zstandard, but in principle it shouldn't matter as libzim should support both. I would say neither of these are critical. Feel free to look into them if you wish, or else to advance your PR without solving those issues, and for now just leave a note saying that only recent unsplit ZIMs are currently supported, pending further investigation. |
In the meantime, I'll try to debug what's going on with |
@Rishabhg71 Thanks for changing that -- the UI is much clearer now, and you've also managed to declutter by hiding the dropdown till the user selects to use libzim. Could you change the tooltip that appears when hovering over the dropdown from "Libzim wasm mode" to "Uses the selected version of libzim to access the ZIM contents (ServiceWorker mode only)"? Also, change the label from "Libzim version: (Select which version to use)" to "Select libzim version to use:" (in bold). And you can remove the tooltip for the Use libzim checkbox (we don't need two similar tooltips). I still haven't had a chance to debug as I said above, but in the meantime, when you're satisfied with the UI, perhaps you could work on the translations? |
@Jaifroid The Requested changes are done. Could you check the translations if they are correct? It was auto-generated by Copilot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for this PR, we're nearly there, but unfortunately you've fallen victim to the AI hallucination effect, as the translations are pretty much all hallucinated. I think only one was correct! Not your fault, but I'd always advise to check a translation by doing reverse translation back into English. A few other things noted.
i18n/es.jsonp.js
Outdated
"configure-expert-useLibzim-tip": "Si está usando el modo ServiceWorker, puede elegir entre usar la biblioteca Libzim o la biblioteca ZIMjs para cargar los archivos ZIM. La biblioteca Libzim es más rápida, pero la biblioteca ZIMjs es más robusta y puede manejar archivos ZIM más grandes. Si tiene problemas con la biblioteca Libzim, intente cambiar a la biblioteca ZIMjs.", | ||
"configure-expert-useLibzim": "Usar la biblioteca Libzim ", | ||
"configure-expert-useLibzim-warning": "¡ADVERTENCIA! Esta opción sólo se aplica al modo ServiceWorker.", | ||
"configure-expert-libzimMode-tip": "Seleccione la versión de la biblioteca Libzim que desea usar. La versión 5 es más rápida, pero la versión 6 es más robusta y puede manejar archivos ZIM más grandes. Si tiene problemas con la versión 5, intente cambiar a la versión 6.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Laughable translation. Look at the length of it! It says that Version 5 is faster, but Version 6 is more robust... Where did that come from??!!! Just shows how bad AI can be...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but you haven't changed this laughable version! It still says "Version 5 is faster but Version 7 is more robust and can deal with larger ZIMs...." Etc. Pure hallucination. However, as per recommendation above, I'd just remove this.
i18n/fr.jsonp.js
Outdated
"configure-expert-useLibzim-tip": "Utiliser la bibliothèque ZIM pour lire les fichiers ZIM (nécessite un redémarrage).", | ||
"configure-expert-useLibzim": "Utiliser la bibliothèque ZIM ", | ||
"configure-expert-useLibzim-warning": "Utilise l'API libzim instable pour lire zim", | ||
"configure-expert-libzimMode-tip": "Mode de la bibliothèque ZIM (défaut : 0)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It says "Mode of the ZIM library (default: 0)" !!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Umm, you didn't address this point! Again, pure hallucination.
i18n/fr.jsonp.js
Outdated
"configure-expert-useLibzim": "Utiliser la bibliothèque ZIM ", | ||
"configure-expert-useLibzim-warning": "Utilise l'API libzim instable pour lire zim", | ||
"configure-expert-libzimMode-tip": "Mode de la bibliothèque ZIM (défaut : 0)", | ||
"configure-expert-libzimMode": "Mode de la bibliothèque ZIM", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, you marked this as resolved, but didn't change it.
www/js/init.js
Outdated
@@ -55,6 +55,8 @@ | |||
* @property {string} cacheIDB - Name of the Indexed DB database | |||
* @property {boolean} isFileSystemApiSupported - A boolean indicating whether the FileSystem API is supported. | |||
* @property {boolean} isWebkitDirApiSupported - A boolean indicating whether the Webkit Directory API is supported. | |||
* @property {boolean} useLibzim - A boolean indicating weather to use the libzim to load zim files. | |||
* @property {"wasm-dev" | "wasm" | "asm" | "asm-dev" | "default"} libzimMode - A value indicating which libzim mode is selected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are listing permitted values of the property, but the actual property type is {String}
. A property's type and its permitted values are not the same thing in my books. But I may be wrong, and this could be a JSDoc convention. Can you point me to any JSDoc documentation that this is the correct way to list accepted values? I think a parallel might be listing a type as {Boolean}
as opposed to listing it as {True | False | Null | 1 | 0 | "" | Undefined}
, because those are all permitted Boolean values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, you marked some of my review comments from last time as resolved when they haven't changed at all, and still have hallucinated translations in them. Maybe you didn't expand the conversations in the conversations tab (GitHub tends to hide some if there are too many). In future, it's best to look at review comments directly in the Files changed tab, where they will always show.
EDIT: In Files Changed tab, you need to load the Diffs on the jsonp files to see the comments on the translations.
www/js/init.js
Outdated
@@ -55,6 +55,8 @@ | |||
* @property {string} cacheIDB - Name of the Indexed DB database | |||
* @property {boolean} isFileSystemApiSupported - A boolean indicating whether the FileSystem API is supported. | |||
* @property {boolean} isWebkitDirApiSupported - A boolean indicating whether the Webkit Directory API is supported. | |||
* @property {boolean} useLibzim - A boolean indicating weather to use the libzim to load zim files. | |||
* @property {"wasm-dev" | "wasm" | "asm" | "asm-dev" | "default"} libzimMode - A value indicating which libzim mode is selected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, you've clearly researched this, thanks for the info. I learnt something.😉
i18n/es.jsonp.js
Outdated
"configure-expert-useLibzim-tip": "Si está usando el modo ServiceWorker, puede elegir entre usar la biblioteca Libzim o la biblioteca ZIMjs para cargar los archivos ZIM. La biblioteca Libzim es más rápida, pero la biblioteca ZIMjs es más robusta y puede manejar archivos ZIM más grandes. Si tiene problemas con la biblioteca Libzim, intente cambiar a la biblioteca ZIMjs.", | ||
"configure-expert-useLibzim": "Usar la biblioteca Libzim ", | ||
"configure-expert-useLibzim-warning": "¡ADVERTENCIA! Esta opción sólo se aplica al modo ServiceWorker.", | ||
"configure-expert-libzimMode-tip": "Seleccione la versión de la biblioteca Libzim que desea usar. La versión 5 es más rápida, pero la versión 6 es más robusta y puede manejar archivos ZIM más grandes. Si tiene problemas con la versión 5, intente cambiar a la versión 6.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but you haven't changed this laughable version! It still says "Version 5 is faster but Version 7 is more robust and can deal with larger ZIMs...." Etc. Pure hallucination. However, as per recommendation above, I'd just remove this.
i18n/es.jsonp.js
Outdated
"configure-expert-useLibzim": "Usar la biblioteca Libzim ", | ||
"configure-expert-useLibzim-warning": "¡ADVERTENCIA! Esta opción sólo se aplica al modo ServiceWorker.", | ||
"configure-expert-libzimMode-tip": "Seleccione la versión de la biblioteca Libzim que desea usar. La versión 5 es más rápida, pero la versión 6 es más robusta y puede manejar archivos ZIM más grandes. Si tiene problemas con la versión 5, intente cambiar a la versión 6.", | ||
"configure-expert-libzimMode": "Versión de la biblioteca Libzim", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Translation is still wrong. You need "Seleccionar versión libzim para usar:" (be sure to add colon at end)
i18n/fr.jsonp.js
Outdated
"configure-expert-useLibzim-tip": "Utiliser la bibliothèque ZIM pour lire les fichiers ZIM (nécessite un redémarrage).", | ||
"configure-expert-useLibzim": "Utiliser la bibliothèque ZIM ", | ||
"configure-expert-useLibzim-warning": "Utilise l'API libzim instable pour lire zim", | ||
"configure-expert-libzimMode-tip": "Mode de la bibliothèque ZIM (défaut : 0)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Umm, you didn't address this point! Again, pure hallucination.
i18n/fr.jsonp.js
Outdated
"configure-expert-useLibzim": "Utiliser la bibliothèque ZIM ", | ||
"configure-expert-useLibzim-warning": "Utilise l'API libzim instable pour lire zim", | ||
"configure-expert-libzimMode-tip": "Mode de la bibliothèque ZIM (défaut : 0)", | ||
"configure-expert-libzimMode": "Mode de la bibliothèque ZIM", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, you marked this as resolved, but didn't change it.
3487b09
to
633e27b
Compare
Could you take another look it weird I did the translations via deepl maybe I messed up something |
The problem was the old translations that got missed probably due to the GitHub comment-hiding "feature", not so much the new ones. I'm looking now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that's a great improvement! There are some last minor but niggly little things -- please carefully go through the comments below, and when you're happy you've addressed these, feel free to squash and merge, I don't need to check again. One you missed yesterday/today, was converting the double-quotes to single-quotes in your type declaration to match the house style, so don't forget to include it in your last revisions. If you see some comments are hidden by GitHub in the list below, please expand them and/or go to the Files Changed tab and view the Diffs on all files.
www/js/init.js
Outdated
@@ -55,6 +55,8 @@ | |||
* @property {string} cacheIDB - Name of the Indexed DB database | |||
* @property {boolean} isFileSystemApiSupported - A boolean indicating whether the FileSystem API is supported. | |||
* @property {boolean} isWebkitDirApiSupported - A boolean indicating whether the Webkit Directory API is supported. | |||
* @property {boolean} useLibzim - A boolean indicating weather to use the libzim to load zim files. | |||
* @property {"wasm-dev" | "wasm" | "asm" | "asm-dev" | "default"} libzimMode - A value indicating which libzim mode is selected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You haven't changed double-quotes to single as mentioned above. It looks incongruous otherwise.
Overall, many thanks for this PR -- it's a nice final one for your internship. I'm very pleased with your overall achievements in this repo: your PRs make a real and tangible difference to the Kiwix JS app! Well done! |
0eb05b3
to
ba17691
Compare
No description provided.