Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modification to syntax.js parser RegEx to provide better Unicode support #4390

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

AndyKilmory
Copy link
Collaborator

@AndyKilmory AndyKilmory commented Jan 8, 2025

What does this change?

Currently if a user inputs unicode characters into the general search string (e.g Möbius) then after the user has viewed a matching image and returned to the grid view via 'Back to Search' (or performed a page refresh) the unicode characters are lost from the search bar content...

Screenshot 2025-01-08 at 10 23 28

This PR corrects this problem by modifying the parser RegEx that is used to analyse the query content to create the structured query including chips. After implementation the resulting search bar content retains the unicode characters...

Screenshot 2025-01-08 at 10 24 50

The characters are also retained within chips within the search bar.

How should a reviewer test this change?

Ensure that any unicode characters entered into the search bar are retained on page refresh.

Who should look at this?

Tested? Documented?

  • locally by committer
  • locally by Guardian reviewer
  • on the Guardian's TEST environment
  • relevant documentation added or amended (if needed)

@AndyKilmory AndyKilmory requested review from a team as code owners January 8, 2025 10:32
@paperboyo
Copy link
Contributor

This regex is pure evil: quotes disappear, chips to arbitrary metadata are broken, ligatures…

But thank you, it fixes diacritics at least. So (almost) fixes #1900.
🙏

@andrew-nowak
Copy link
Member

I've had a good rack of my brains at what the problem we encountered was with this when we attempted previously, my best guess is that we also tried to fix the problem of searching with a chip with a quoted field name (@paperboyo 's example for this was "fileMetadata.icc.Profile Description":"sRGB IEC61966-2.1" - trying to use this on PROD loses the chip, and the search bar falls out of sync with the url), and we couldn't get both to play nicely together, so gave up on both, which was a shame. This seems to work fine in isolation, at least in all the examples I could think of I couldn't find a combination which failed.

This inspired me to have another go again at getting the regex to play nicely with quoted field names, and building on top of your changes, I got to 964a6f0 which currently seems to cover all the cases I can think of.

@paperboyo is back on Monday, and I'll ask him to give it all one more look, but I can't see any reason to object to this. That said, it looks like a couple of unrelated changes to the start scrips have crept into this branch, can we drop those commits? Then we should be able to get this merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants