Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use libzim for reading ZIM contents #1160

Merged
merged 13 commits into from
Jan 15, 2024
Merged

Use libzim for reading ZIM contents #1160

merged 13 commits into from
Jan 15, 2024

Conversation

Rishabhg71
Copy link
Collaborator

No description provided.

@Rishabhg71 Rishabhg71 self-assigned this Nov 13, 2023
@Rishabhg71
Copy link
Collaborator Author

Rishabhg71 commented Nov 13, 2023

@Jaifroid Basic UI is done. What I wanted to know is whether there is any implementation for libzim or do I start from scratch (except for other PR). Also what functions should I implement first?

@Jaifroid
Copy link
Member

Jaifroid commented Nov 13, 2023

For now we can only implement the functions that are currently exposed in the libzim bindings over at javascript/libzim, and for all intents and purposes, I believe those are full-text search (already implemented in our production code, so there is nothing to do to that at this stage), retrieving an article by path, and retrieving assets by path or URL. All this is enough to be able to read and browse the contents of a ZIM.

It would be a good idea to allow developers to load (from the interface), any one of the four implementations of libzim that we have, as each will be helpful in analysing and decoding issues:

  1. The default would be off (use existing implementation, or disable libzim for article loading)
  2. Use the production WASM (i.e., minified and production-ready version) for article loading;
  3. Use the dev version of the WASM implementation (this is built with error assertions, so we get meaningful feedback from errors occurring inside the WASM);
  4. Use the production ASM version for older browsers;
  5. Use the dev ASM version (useful because it is pure JavaScript, so can be debugged with DEV tools).

You can see an example of the kind of choice I'm suggesting over at the PWA (https://pwa.kiwix.org), where I included such a dropdown under Troubleshooting and Development in Configuration. However, that dropdown only chooses the implementation for full-text searching, it doesn't allow the WASM to be loaded for article retrieval. Your implementation should not be about search, but should instead be about using the WASM for reading articles.

Needless to say, you mustn't load the WASM twice, in other words, your dropdown should control the EXISTING loading of the WASM in zimArchive.js. It is that version that you will then call like in the lbizm experiments PR #766. If the user disables WASM use in your dropdown, full-text searching should still work (in other words, Disable would only turn off reading articles with libzim WASM, not search).

You can start with Service Worker mode (because that's easiest), but there is no reason why libzim can't be used in jQuery mode too, and once you've understood the logic in SW mode it should be easy to use the same calls in jQuery mode.

I'm not sure I've expressed myself clearly. Let me know if I can clarify anything.

@Jaifroid
Copy link
Member

By the way, I think the title of this PR isn't very clear. We already use WASM (and ASM) for XZ and ZSTD. This PR is about libzim, not whether it is ASM or WASM. I suggest changing the title to something like "Use libzim for reading ZIM contents" (or something similar)!

@Jaifroid Jaifroid added enhancement dependencies Pull requests that update a dependency file backend labels Nov 13, 2023
@Jaifroid Jaifroid added this to the v4.0 milestone Nov 13, 2023
@Rishabhg71 Rishabhg71 changed the title WASM mode for kiwix-js Use libzim for reading ZIM contents Nov 14, 2023
@Rishabhg71
Copy link
Collaborator Author

Rishabhg71 commented Nov 15, 2023

@Jaifroid I was working on getEntryByPath from libzim_binding but I don't what function does what. There are no docs here as well, is there any place I could directly look for comments on these functions? I do have a vague idea of what args and functionality https://github.com/openzim/javascript-libzim/blob/main/libzim_bindings.cpp#L84C2-L84C2

@Jaifroid
Copy link
Member

@Rishabhg71 You have to look at the libzim source itself here: https://github.com/openzim/libzim/tree/main/src . Basically, the bindings try to expose in JS the functions that we need from libzim.

As well as the source code, there is this:

https://libzim.readthedocs.io/en/stable/

Hopefully some of that helps!

However, remember we've already got the most useful bindings exposed, so you don't need to replicate those. It's great if you can get your head round them, though, as at least one of them doesn't seem to have proper error catching.

@Rishabhg71
Copy link
Collaborator Author

@Jaifroid I am still confused as I was before. I will try explaining with an example like there is a function called getEntryByPath now I can call this easily but I don't know which args to pass. Looking into the docs I get the vague idea that I need to pass some sort of path but which path?? (I tried /| index.html | this.file.mainPage). I can't go on and create a proper mapping if I don't know which args to pass. Do you have any idea how do I proceed?

@Jaifroid
Copy link
Member

The PWA is an easy way to see all the URL paths in a ZIM. Open a ZIM, and in the search bar, type space + / (space followed by forward slash). That will evoke the URL index, and then you can clearly see what kind of ZIM you're dealing with, a Type 0 (articles in A/ namespace), a Type 1 (all user content in C/ namespace) or a hybrid Zimit type (with namespaces that look like C/A(... for user content, and C/H/... for response headers).

If you are interested in a particular article, you can type part of its path into the search box, like A/Far-. So long as you begin with a valid namespace, it will search the URL index (see image below).

As far as I know, the "path" is in fact the ZIM URL as shown in the image below. Try it and see. Note that in Wikimedia ZIMs, it's still A/..., whereas in most other newer ZIMs it should be C/...

image

@Jaifroid
Copy link
Member

PS In both the Kiwix JS code (current code on main) and the Kiwix PWA code, you can find in Dev Tools a printout of the ZIM File properties, as well of the ZIM Archive properties after you open a ZIM. There it also tells you what ZIM type you're dealing with.

image

image

@Jaifroid
Copy link
Member

And did you try out the libzim test implementation linked from here: https://github.com/openzim/javascript-libzim? That's a really basic implementation that allows you to see clearly how the calls are done to the WASM version (by examining in DEV tools). It's kind of tuned for using a Ray Charles (modern) ZIM, but you can use any so long as you alter the paths. Beware that some ZIMs currently don't work for article loading due to the lack of error-handling I've mentioned before (one thing we need to fix, see openzim/javascript-libzim#64).

@Rishabhg71
Copy link
Collaborator Author

@Jaifroid It seems like after the latest update WASM files are not exporting the functions at all. Reverting back to this commit is working wasm is working 53bcf308389dd5f3827b2007c46c1a318c39874a
image

@Jaifroid
Copy link
Member

OK, I'll check. I updated this branch from main, but didn't test properly, as it's an experimental branch.

@Jaifroid
Copy link
Member

Oh, that's simply because this branch needs to be fixed as I outlined in #1169. As I wrote to you there, you need to call whenZimReady only after the Promise for the libzim has completed. Currently it's being called too early. That's all. You should keep the current call to whenZimReady, but make it conditional on whether the user has selected to use libzim for file reading. As I suggested, it would be good to set a params.useLibzim when the user selects that option, so you can check for it here and call whenZimReady at the appropriate point.

@Rishabhg71
Copy link
Collaborator Author

Thanks to your advice I have a working prototype that I can now build upon. The first issue I have encountered here is only certain Zim files are being loaded properly as you can see in the video below. It seems like something related to namespace working Gutenberg has a namespace of A/Home.html and another one has W/mainPage. Its a guess but could you point me in right direction?

20231122-1918-05.5203601.mp4

@Jaifroid
Copy link
Member

Good news that it's (partially) working!

I don't think that's an issue with the ZIMs. All ZIMs have a W namespace entry, standing for "Well-known entry", aka, pointer to the landing page. That pointer is read when the ZIM is first opened, to determine the location of the landing page directory entry. That should redirect (automatically) to something like index.html (but it won't always be that).

You shouldn't have to handle any of this. It should all be transparent. It's puzzling indeed that it works for one case, but not for another. Could it be that in one case, our backend is resolving W/mainPage, whereas in the other, you're passing that request directly to libzim? Our backend will resolve W/mainPage to C/Home.html in the case of the ro ZIM, but resolving that is the job of a function that may not be available in our bindings, something like getMainPageDirEntry() (that's what we call it).

You could pause on that function in zimArchive.js to see if it's being called in some cases, but not in others, and then work back from there.

@Jaifroid
Copy link
Member

The issue could be much simpler, actually. I just checked the gutenberg ro archive in the Libzim experiments PR, and it's one of the ones that has the reading bug I mentioned in openzim/javascript-libzim#64. If you see these errors in your console, then you're hitting that bug, which needs to be fixed in the way described in that issue.

image

The fix is quite simple. But if you don't want to fix that just yet, then you'll just have to avoid testing your branch on ZIMs that produce these errors, and stick to Wikipedia ZIMs for now.

@Rishabhg71
Copy link
Collaborator Author

Rishabhg71 commented Nov 24, 2023

image

I used the ZIMArchive.resolveRedirect method to obtain the redirected URL, but it seems that the backend (libzim) is unable to retrieve the content from the redirected URL. This issue might be related to the presence of Unicode characters in the URL. Additionally, I tested a simple file without redirects, wikipedia_en_100_maxi_2023-11.zim, and found that this ZIM file loaded perfectly, with all pages functioning properly.

Is there a ZIM file that contains redirected content, but all the data is in English? I'd like to check if Unicode characters are causing the problem.

UPDATE: I tried to send the URL-encoded string as well but that didn't work too

@Jaifroid
Copy link
Member

If you look at the OpenZIM specification, you'll see that UTF8 is the standard and mandatory encoding for all ZIM content. Libzim support UTF8.

Without debugging your code in detail, I would hazard these guesses:

  • You're running into Builds from 0.0.2 onwards (probably also 0.0.1) appear to stall if a dirEntry is not found openzim/javascript-libzim#64, and need to compile new versions of the libzim ASM and WASM. You can tell if this is the case by testing in a Chromium browser, where the error circled in the screenshot above is clearly shown. If you make a PR over there with just the exception catching mentioned (you only need to insert it in two places in the makefile), then the PR will automatically build new versions for you and attach them to the workflow output. You can then test with those.
  • Alternatively, you're not doing decodeURI, or decodeURIComponent (better) on the URL before sending it to libzim.

I strongly suggest you take a look at, and test, your ZIMs in the libzim experiments branch, where the only errors I noticed were the ones mentioned above (Uncaught ....), and we know what the easy solution to that is. If it's working there, and not working in your branch, then the problem is with your branch...

@Rishabhg71
Copy link
Collaborator Author

I finally have a basic working version where most of the zims are worker Gutenberg, StackExchange, and some of the Wikipedia zims. After checking why those zims are not working it seems like they are not hitting the handleMessageChannelMessage function. Could you maybe give me some advice

Zims which are not working

  • wikipedia_en_ray_charles_2015-06.zimaa
  • wikiwel_en_all_maxi_2018-10.zim

When I am done with these zims I can finally add the UI controls

@Jaifroid
Copy link
Member

That's good news!

Regarding the issues you are seeing, the first one is easy to explain -- see openzim/javascript-libzim#16. Feel free to look into why we're not able to load split ZIM files: I confess I haven't taken the time to look into it, because it's not critical and we have another way to load them in our legacy backend. It could be a small issue or something more fundamental).

Regarding the older wikimed, I guess that's using XZ compression rather than zstandard, but in principle it shouldn't matter as libzim should support both.

I would say neither of these are critical. Feel free to look into them if you wish, or else to advance your PR without solving those issues, and for now just leave a note saying that only recent unsplit ZIMs are currently supported, pending further investigation.

@Jaifroid
Copy link
Member

In the meantime, I'll try to debug what's going on with W/mainPage.

@Jaifroid
Copy link
Member

@Rishabhg71 Thanks for changing that -- the UI is much clearer now, and you've also managed to declutter by hiding the dropdown till the user selects to use libzim. Could you change the tooltip that appears when hovering over the dropdown from "Libzim wasm mode" to "Uses the selected version of libzim to access the ZIM contents (ServiceWorker mode only)"? Also, change the label from "Libzim version: (Select which version to use)" to "Select libzim version to use:" (in bold). And you can remove the tooltip for the Use libzim checkbox (we don't need two similar tooltips).

I still haven't had a chance to debug as I said above, but in the meantime, when you're satisfied with the UI, perhaps you could work on the translations?

@Rishabhg71
Copy link
Collaborator Author

@Jaifroid The Requested changes are done. Could you check the translations if they are correct? It was auto-generated by Copilot

Copy link
Member

@Jaifroid Jaifroid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for this PR, we're nearly there, but unfortunately you've fallen victim to the AI hallucination effect, as the translations are pretty much all hallucinated. I think only one was correct! Not your fault, but I'd always advise to check a translation by doing reverse translation back into English. A few other things noted.

www/index.html Outdated Show resolved Hide resolved
i18n/es.jsonp.js Outdated Show resolved Hide resolved
i18n/es.jsonp.js Outdated Show resolved Hide resolved
i18n/es.jsonp.js Outdated Show resolved Hide resolved
i18n/es.jsonp.js Outdated
"configure-expert-useLibzim-tip": "Si está usando el modo ServiceWorker, puede elegir entre usar la biblioteca Libzim o la biblioteca ZIMjs para cargar los archivos ZIM. La biblioteca Libzim es más rápida, pero la biblioteca ZIMjs es más robusta y puede manejar archivos ZIM más grandes. Si tiene problemas con la biblioteca Libzim, intente cambiar a la biblioteca ZIMjs.",
"configure-expert-useLibzim": "Usar la biblioteca Libzim ",
"configure-expert-useLibzim-warning": "¡ADVERTENCIA! Esta opción sólo se aplica al modo ServiceWorker.",
"configure-expert-libzimMode-tip": "Seleccione la versión de la biblioteca Libzim que desea usar. La versión 5 es más rápida, pero la versión 6 es más robusta y puede manejar archivos ZIM más grandes. Si tiene problemas con la versión 5, intente cambiar a la versión 6.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Laughable translation. Look at the length of it! It says that Version 5 is faster, but Version 6 is more robust... Where did that come from??!!! Just shows how bad AI can be...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but you haven't changed this laughable version! It still says "Version 5 is faster but Version 7 is more robust and can deal with larger ZIMs...." Etc. Pure hallucination. However, as per recommendation above, I'd just remove this.

i18n/fr.jsonp.js Outdated
"configure-expert-useLibzim-tip": "Utiliser la bibliothèque ZIM pour lire les fichiers ZIM (nécessite un redémarrage).",
"configure-expert-useLibzim": "Utiliser la bibliothèque ZIM ",
"configure-expert-useLibzim-warning": "Utilise l'API libzim instable pour lire zim",
"configure-expert-libzimMode-tip": "Mode de la bibliothèque ZIM (défaut : 0)",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It says "Mode of the ZIM library (default: 0)" !!!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm, you didn't address this point! Again, pure hallucination.

i18n/fr.jsonp.js Outdated
"configure-expert-useLibzim": "Utiliser la bibliothèque ZIM ",
"configure-expert-useLibzim-warning": "Utilise l'API libzim instable pour lire zim",
"configure-expert-libzimMode-tip": "Mode de la bibliothèque ZIM (défaut : 0)",
"configure-expert-libzimMode": "Mode de la bibliothèque ZIM",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, you marked this as resolved, but didn't change it.

www/index.html Outdated Show resolved Hide resolved
www/js/app.js Outdated Show resolved Hide resolved
www/js/init.js Outdated
@@ -55,6 +55,8 @@
* @property {string} cacheIDB - Name of the Indexed DB database
* @property {boolean} isFileSystemApiSupported - A boolean indicating whether the FileSystem API is supported.
* @property {boolean} isWebkitDirApiSupported - A boolean indicating whether the Webkit Directory API is supported.
* @property {boolean} useLibzim - A boolean indicating weather to use the libzim to load zim files.
* @property {"wasm-dev" | "wasm" | "asm" | "asm-dev" | "default"} libzimMode - A value indicating which libzim mode is selected.
Copy link
Member

@Jaifroid Jaifroid Dec 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are listing permitted values of the property, but the actual property type is {String}. A property's type and its permitted values are not the same thing in my books. But I may be wrong, and this could be a JSDoc convention. Can you point me to any JSDoc documentation that this is the correct way to list accepted values? I think a parallel might be listing a type as {Boolean} as opposed to listing it as {True | False | Null | 1 | 0 | "" | Undefined}, because those are all permitted Boolean values.

Copy link
Member

@Jaifroid Jaifroid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, you marked some of my review comments from last time as resolved when they haven't changed at all, and still have hallucinated translations in them. Maybe you didn't expand the conversations in the conversations tab (GitHub tends to hide some if there are too many). In future, it's best to look at review comments directly in the Files changed tab, where they will always show.

EDIT: In Files Changed tab, you need to load the Diffs on the jsonp files to see the comments on the translations.

www/js/init.js Outdated
@@ -55,6 +55,8 @@
* @property {string} cacheIDB - Name of the Indexed DB database
* @property {boolean} isFileSystemApiSupported - A boolean indicating whether the FileSystem API is supported.
* @property {boolean} isWebkitDirApiSupported - A boolean indicating whether the Webkit Directory API is supported.
* @property {boolean} useLibzim - A boolean indicating weather to use the libzim to load zim files.
* @property {"wasm-dev" | "wasm" | "asm" | "asm-dev" | "default"} libzimMode - A value indicating which libzim mode is selected.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, you've clearly researched this, thanks for the info. I learnt something.😉

i18n/en.jsonp.js Outdated Show resolved Hide resolved
i18n/es.jsonp.js Outdated
"configure-expert-useLibzim-tip": "Si está usando el modo ServiceWorker, puede elegir entre usar la biblioteca Libzim o la biblioteca ZIMjs para cargar los archivos ZIM. La biblioteca Libzim es más rápida, pero la biblioteca ZIMjs es más robusta y puede manejar archivos ZIM más grandes. Si tiene problemas con la biblioteca Libzim, intente cambiar a la biblioteca ZIMjs.",
"configure-expert-useLibzim": "Usar la biblioteca Libzim ",
"configure-expert-useLibzim-warning": "¡ADVERTENCIA! Esta opción sólo se aplica al modo ServiceWorker.",
"configure-expert-libzimMode-tip": "Seleccione la versión de la biblioteca Libzim que desea usar. La versión 5 es más rápida, pero la versión 6 es más robusta y puede manejar archivos ZIM más grandes. Si tiene problemas con la versión 5, intente cambiar a la versión 6.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but you haven't changed this laughable version! It still says "Version 5 is faster but Version 7 is more robust and can deal with larger ZIMs...." Etc. Pure hallucination. However, as per recommendation above, I'd just remove this.

i18n/es.jsonp.js Outdated
"configure-expert-useLibzim": "Usar la biblioteca Libzim ",
"configure-expert-useLibzim-warning": "¡ADVERTENCIA! Esta opción sólo se aplica al modo ServiceWorker.",
"configure-expert-libzimMode-tip": "Seleccione la versión de la biblioteca Libzim que desea usar. La versión 5 es más rápida, pero la versión 6 es más robusta y puede manejar archivos ZIM más grandes. Si tiene problemas con la versión 5, intente cambiar a la versión 6.",
"configure-expert-libzimMode": "Versión de la biblioteca Libzim",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Translation is still wrong. You need "Seleccionar versión libzim para usar:" (be sure to add colon at end)

i18n/fr.jsonp.js Outdated
"configure-expert-useLibzim-tip": "Utiliser la bibliothèque ZIM pour lire les fichiers ZIM (nécessite un redémarrage).",
"configure-expert-useLibzim": "Utiliser la bibliothèque ZIM ",
"configure-expert-useLibzim-warning": "Utilise l'API libzim instable pour lire zim",
"configure-expert-libzimMode-tip": "Mode de la bibliothèque ZIM (défaut : 0)",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm, you didn't address this point! Again, pure hallucination.

i18n/fr.jsonp.js Outdated
"configure-expert-useLibzim": "Utiliser la bibliothèque ZIM ",
"configure-expert-useLibzim-warning": "Utilise l'API libzim instable pour lire zim",
"configure-expert-libzimMode-tip": "Mode de la bibliothèque ZIM (défaut : 0)",
"configure-expert-libzimMode": "Mode de la bibliothèque ZIM",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, you marked this as resolved, but didn't change it.

@Rishabhg71
Copy link
Collaborator Author

Could you take another look it weird I did the translations via deepl maybe I messed up something

@Jaifroid
Copy link
Member

Could you take another look it weird I did the translations via deepl maybe I messed up something

The problem was the old translations that got missed probably due to the GitHub comment-hiding "feature", not so much the new ones. I'm looking now.

Copy link
Member

@Jaifroid Jaifroid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's a great improvement! There are some last minor but niggly little things -- please carefully go through the comments below, and when you're happy you've addressed these, feel free to squash and merge, I don't need to check again. One you missed yesterday/today, was converting the double-quotes to single-quotes in your type declaration to match the house style, so don't forget to include it in your last revisions. If you see some comments are hidden by GitHub in the list below, please expand them and/or go to the Files Changed tab and view the Diffs on all files.

i18n/fr.jsonp.js Outdated Show resolved Hide resolved
i18n/fr.jsonp.js Show resolved Hide resolved
i18n/es.jsonp.js Outdated Show resolved Hide resolved
i18n/en.jsonp.js Outdated Show resolved Hide resolved
i18n/es.jsonp.js Outdated Show resolved Hide resolved
i18n/fr.jsonp.js Outdated Show resolved Hide resolved
www/index.html Outdated Show resolved Hide resolved
www/js/init.js Outdated
@@ -55,6 +55,8 @@
* @property {string} cacheIDB - Name of the Indexed DB database
* @property {boolean} isFileSystemApiSupported - A boolean indicating whether the FileSystem API is supported.
* @property {boolean} isWebkitDirApiSupported - A boolean indicating whether the Webkit Directory API is supported.
* @property {boolean} useLibzim - A boolean indicating weather to use the libzim to load zim files.
* @property {"wasm-dev" | "wasm" | "asm" | "asm-dev" | "default"} libzimMode - A value indicating which libzim mode is selected.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You haven't changed double-quotes to single as mentioned above. It looks incongruous otherwise.

www/js/lib/zimArchive.js Outdated Show resolved Hide resolved
@Jaifroid
Copy link
Member

Overall, many thanks for this PR -- it's a nice final one for your internship. I'm very pleased with your overall achievements in this repo: your PRs make a real and tangible difference to the Kiwix JS app! Well done!

@Jaifroid Jaifroid mentioned this pull request Jan 12, 2024
39 tasks
@Rishabhg71 Rishabhg71 merged commit 11138ff into main Jan 15, 2024
9 checks passed
@Rishabhg71 Rishabhg71 deleted the libzim-ui-option branch January 15, 2024 17:03
@Jaifroid Jaifroid mentioned this pull request Jan 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend dependencies Pull requests that update a dependency file enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants