-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Internationalization of page titles #140
Comments
Question: Will changing the slug format break searching? I was concerned that changing the slug format was going to break searching, but it doesn’t seem that it will. Searching uses JavaScript’s Current behavior:
|
The next question is compatibility. I have found a couple important things:
This suggests to me that it might be good to decrease the use of the slug elsewhere and only use the narrower representation for the persistence layer. Then if we have the full version in most places, we can derive information about the old encoding to handle that case as needed. What I mean is that if there’s a page Page links are stored on disk as typed.
This is also what goes over the wire in the server’s response when pages are requested. The uses of the slug are fewer than I thought. Pages are stored on disk under a filename corresponding to their slug, which of course is currently the old slug that loses non-Latin-character information. Exploring possible solution: as a first step, the server learns to respond to Unicode “slugs”One way to tackle this could be to speak Unicode instead of slugs over the wire, while maintaining a similar lossy behavior when slugs are spoken over the wire. It might go like this: The upgraded server renames all page files to some non-lossy encoding of the Unicode page title (Punycode, actual Unicode name, URLencoding) and checks this at startup. The upgraded server maintains a list of old slugs for all its pages, computed from the Unicode page titles. When receiving a request that matches an old slug, the server returns the page corresponding to that. Its behavior when multiple pages match is to return one of them, undefined, much as wiki’s current undefined behavior when multiple Unicode page names resolve to the same slug. So the upgraded server behaves just like current wiki when sent a slug from current wiki. When receiving a request in upgraded wiki’s new non-lossy format, don’t interpret it, only map it to whatever non-lossy filename format is used and return the corresponding page (or return 404 if it doesn’t exist). My remaining questions are related to other sites, both requesting their pages and interpreting their |
I have gotten through my pass through all the variables I saw discussed, in using wiki, and in the code I saw. I think I have found a solution that will work well for humans, machines, and this project’s needs. So here is my proposal for what to do. I would appreciate the team’s feedback, especially @almereyda, @WardCunningham, and @paul90 who have been kindly discussing i18n with me in #139 and fedwiki/wiki-client#103. I am ready to make a PR taking a stab at this if the team feels they’d be willing to merge a good implementation of it. Thank you for your encouragement and willingness to look at this issue. tl;drUnicode slugs with backward compatibility. Biggest impact outside the code base is renaming files in the wiki’s Summary of the proposal
New slug format (“Unislug”)Non-lossy slugs, which are called Unislugs here, are generated from the Unicode page title as follows:
The result after this preprocessing will continue to have non-Latin characters. The Unislug format has the property that existing OldSlugs can be derived from Unislugs, which creates the opportunity for thorough backward compatibility. Code changesWhat changes will be made to the upgraded server code?The upgraded server renames all page files to some non-lossy encoding of Unislug (this might be Punycoded Unislug, the Unislug itself, or URLencoded Unislug). This can be a one-time process that writes an upgraded marker file somewhere when it’s been done. The server checks for this at startup, and if it’s not done it either does it or exits. Note that there may be many wikis that do not have any pages with non-Latin characters in page titles, and those may not need any changes, so we should certainly not bother upgrading users in that case. The upgraded server maintains a list of OldSlugs for all its pages, computed from the Unislugs. When receiving a request that matches an OldSlug on the list, the server returns the page corresponding to that. Its behavior when multiple pages match is to return one of them, undefined, much as wiki’s current undefined behavior when multiple Unicode page names resolve to the same slug. So the upgraded server behaves just like current wiki when current wiki requests an OldSlug. Currently wiki writes a file for the first page to match a particular slug, but now it will be possible for wiki to have multiple pages matching a particular slug. First of all, this feels like undefined behavior that can’t reasonably considered part of a contract wiki makes. But even if so, serving the matching page whose file has the earliest created timestamp will very closely approximate wiki’s current behavior. When receiving a request for upgraded wiki’s Unislug, don’t interpret it, only map it to whatever non-lossy filename format is used and return the corresponding page (or return 404 if it doesn’t exist). What changes will be made to the upgraded client code?Links to wiki page names are URLencoded in the HTML anchor tags, so that they display as the Unislugs including any non-Latin characters in the URL bar, just as happens on Wikipedia. Sitemaps
|
I dug in further and was glad to find that cooking down Unicode can be way simpler and much more like what wiki does today: https://javascript.info/regexp-character-sets-and-ranges asUnislug = (name) ->
name.replace(/\s/g, '-').replace(
/[^\p{Alphabetic}\p{Mark}0-9-]/gu, ''
).toLowerCase() I hadn't looked into it before now, but it looks like these Unicode regexes are usable by 95% of browsers: CanIUse browser support statistics This throws a try {
asUnislug(name)
} catch (e) {
if (e instanceof SyntaxError) {
asSlug(name);
} else {
throw(e);
}
} |
We do appreciate your effort here. Thanks. |
OldSlug vs. Unislug comparison
Resolving conflicts
I am unaware of anything that would break in this process. The collisions discussed are already present in current wiki. We continue to strictly limit the character set usable in slugs. |
Got some feedback. Next actions:
|
Best @replaid, many thanks for paving the way here. Looking forward to reading http://john.permakultura.wiki/guilds.html also in many other languages, when we finish through with #139. Minor observations made me slow down during reading. They were:
With the beautiful examples that you collected, and with the anticipated changes that you stepped through for us, I am very confident that we can grow our local(-ised 🌐) wiki communities. |
As a wiki author working in a language with a non-Latin script, I want to be able to link to a page like
[[Гильдии]]
(the equivalent of[[Guilds]]
in Russian). Currently wiki strips out all the non-Latin characters from page titles, soГильдии
converts to a slug that is the empty string.In the specific case of the Russian language, the alphabet is very phonetic, so many Russian websites have software to solve this problem by mapping the Cyrillic letters to Latin letters or clusters of Latin letters, in this case
Gil'dii
. However, it seems likely that such a "Russian mode" would not be the best solution for wiki.It looks to me like there is a fork in the road. I will call the two general paths I see "incompatible slug" and "compatible slug" as a best effort to describe this.
Incompatible slug
We could opt to present non-Latin page title characters directly in the URLs and let the backend map those to and from some kind of encoding for the filenames (or even filenames in a subset of UTF-8). This is the general approach Wikipedia takes. When someone links to
[[Гильдии]]
, Wikipedia URLencodes that Unicode page title into the link like<a href="/wiki/%D0%93%D0%B8%D0%BB%D1%8C%D0%B4%D0%B8%D0%B8">
. Users see/wiki/Гильдии
in the URL bar. I don't know what exactly happens on the back end.I suspect this approach has a different texture than the current approach of Federated Wiki. I think it would represent a change in direction and would have repercussions that would upset the "just enough" ethos that characterizes this project.
Compatible slug
Alternatively, we could convert non-Latin letters to some slug that is compatible with the current backend.
This is pretty much exactly how domain names with non-Latin letters are handled: they are mapped to Latin letters using Punycode (in which
Гильдии
is lowercased and encodes toxn--c1aclbap3j
), and are then compatible with existing DNS. Slugs that are not based primarily on Latin characters are unreadable to humans, but they are unambiguously decodable to a lowercased version of the non-Latin input.If we were to add Punycode encoding to the
asSlug
method, I think a lot of things would just work, especially based on how well wiki works when I create a page by typing[[по-русски]]
: this discards all the letters and just creates the slug-
, but this works fine as long as I don't make another such page.I have not yet dug into the code for sitemaps and searches, but I would imagine that this code would need to become aware of Punycode-decoding slugs that begin with
xn--
. But once that's in, I think it would be quite transparent.[[Гильдии]]
xn--c1aclbap3j
xn--c1aclbap3j
гил
into the search barxn--c1aclbap3j
in the sitemap information and decoded that toгильдии
for search matching purposesгил
matches as a substring ofгильдии
just asguil
matches as a substring ofguilds
for a page namedGuilds
.xn--c1aclbap3j
and clicks itгильдии
, which is replaced withГильдии
when the page title loadsSo the only real user-visible wart is the presence of the Punycode in the link itself.
This could even be a stepping stone to eventually keeping the client experience in native languages in future steps.
I may be missing other uses of the slug that need to be accounted for.
This issue is an offshoot of conversations at fedwiki/wiki-client#103 and #139.
The text was updated successfully, but these errors were encountered: