Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support non-ASCII characters in ISO-8859-x charsets #3310

Merged
merged 35 commits into from
Jul 23, 2024
Merged

Conversation

seisman
Copy link
Member

@seisman seisman commented Jul 3, 2024

This PR enhances PyGMT to support non-ASCII characters in ISO-8859-x charsets (Refer to https://en.wikipedia.org/wiki/ISO/IEC_8859-1, https://en.wikipedia.org/wiki/ISO/IEC_8859-2 and more for available characters in each encoding). Related to #2204.

In GMT, the default encoding is ISOLatin1+ (or Standard+ but it's not supported in PyGMT yet). To use non-ASCII characters in ISO-8859-x, we need to use --PS_CHAR_ENCODING=<encoding> according to https://docs.generic-mapping-tools.org/dev/gmt.conf.html#term-PS_CHAR_ENCODING. Please note that we can only use one encoding in a single GMT module call, which means mixing characters from different encodings will cause trouble. So, the implementation is simple. We just need to check the argument string or Figure.text's text string and determine the charset encoding to use, and then add --PS_CHAR_ENCODING=<encoding> when calling modules.

This PR does a few things:

  1. Add the mapping dictionaries of all ISO-8859-x charsets to pygmt/encodings.py
  2. Add check_encoding to check the encoding of a string.
  3. Add the new encoding parameter to non_ascii_to_otcal so that we know the encoding to use when converting non-ascii.
  4. Update to make PyGMT arguments and Figure.text's text parameter support ISO-8859-x charsets.

Preview : https://pygmt-dev--3310.org.readthedocs.build/en/3310/techref/encodings.html


Here is an example showing how it works with non-ASCII characters in ISO-8859-4. The same script is added as a test.

import pygmt

fig = pygmt.Figure()
fig.basemap(region=[0, 10, 0, 10], projection="X10c", frame=["WSEN+tAāáâãäåB"])
fig.text(position="TL", text="position-text:1ÉĘËĖ2")
fig.text(x=1, y=1, text="xytext:1éęëė2")
fig.text(x=[5, 5], y=[3, 5], text=["xytext1:ųúûüũūαζ∆❡", "xytext2:íîī∑π∇✉"])
fig.show()

The output is
image


The script below shows all the characters in different ISO-8859-x encodings. I don't check them all carefully, but most look correct. The exceptions are ISO-8859-6, ISO-8859-8, and ISO-8859-11. They don't work with standard fonts (https://docs.generic-mapping-tools.org/dev/gmt.conf.html#term-PS_CHAR_ENCODING) but I think it's OK because whoever wants to use these characters should have the fonts installed. Some characters like in ISO-8859-14 are missing in the figure below and they also can't be shown correctly in GMT CLI. Looking at the charset definition files in the GMT source codes, the character has a PS name like /uni1E40 (https://github.com/GenericMappingTools/gmt/blob/0584d620d27de5ed93a90e4b0dc56a2edb2d568a/src/PSL_ISO-8859-14.h#L24C43-L24C43), but it doesn't seem like a valid PS name. So, it must be upstream issues if any.

import pygmt
from pygmt.encodings import charset

fig = pygmt.Figure()
with fig.subplot(ncols=5, nrows=3, subsize=("10c", "15c"), margins=0.5):

    for i in range(1, 17):
        if i == 12:
            continue
        encoding = f"ISO-8859-{i}"
        chars = "".join(charset[encoding].values())
        fig.basemap(region=[0, 16, 0, 13], projection="X?/-?", frame=[0, f"+t{encoding}"], panel=True)
        for j in range(0, len(chars), 16):
            fig.text(x=1, y=int(j/16) + 1, text=chars[j:j+16], font="20p", justify="ML")
fig.savefig("encodings.png")

encodings

@seisman seisman changed the title Support non-ASCII characters in ISO-8859-x charsets WIP: Support non-ASCII characters in ISO-8859-x charsets Jul 3, 2024
@seisman seisman added the feature Brand new feature label Jul 3, 2024
@seisman seisman added this to the 0.13.0 milestone Jul 3, 2024
Copy link
Contributor

github-actions bot commented Jul 4, 2024

Summary of changed images

This is an auto-generated report of images that have changed on the DVC remote

Status Path
added pygmt/tests/baseline/test_text_nonascii_iso8859.png

Image diff(s)

Added images

  • test_text_nonascii_iso8859.png

Modified images

Path Old New

Report last updated at commit 3474bb2

@seisman seisman changed the title WIP: Support non-ASCII characters in ISO-8859-x charsets Support non-ASCII characters in ISO-8859-x charsets Jul 4, 2024
@seisman seisman added enhancement Improving an existing feature needs review This PR has higher priority and needs review. and removed feature Brand new feature labels Jul 4, 2024
@seisman seisman marked this pull request as ready for review July 4, 2024 08:29
pygmt/encodings.py Outdated Show resolved Hide resolved
pygmt/encodings.py Outdated Show resolved Hide resolved
pygmt/helpers/utils.py Outdated Show resolved Hide resolved
Co-authored-by: Yvonne Fröhlich <[email protected]>
return encoding
# Return the "ISOLatin1+" encoding if the string contains characters from multiple
# charset encodings or contains characters that are not in any charset encoding.
return "ISOLatin1+"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function returns ISOLatin1+/ISO-8859-1/.../ISO-8859-16, but Python's standard encodings are latin_1/iso8859_1/...iso8859_16 (https://docs.python.org/3/library/codecs.html#standard-encodings). They're inconsistent, but using names like ISOLatin1+ can greatly simplify the codes.

'ISOLatin1+'
>>> check_encoding("123AB中文") # Characters not in any charset encoding
'ISOLatin1+'
"""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the GMT-supported encodings, I'm thinking if we should return ascii (the name ascii comes from https://docs.python.org/3/library/codecs.html#standard-encodings) if the input string only contains ASCII characters, e.g.,

    if all(32 <= ord(c) <= 126 for c in argstr):
        return "ascii"

Since in most cases, the arguments and the text string contain ASCII characters only. When ascii encoding is detected, we no longer need to apply non_ascii_to_octal to the strings, which may improve the performance for most cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 2d01f6b.

@seisman seisman added the run/benchmark Trigger the benchmark workflow in PRs label Jul 7, 2024
Copy link

codspeed-hq bot commented Jul 7, 2024

CodSpeed Performance Report

Merging #3310 will not alter performance

Comparing iso-encoding (6bd5008) with main (318a8c4)

Summary

✅ 101 untouched benchmarks

@seisman
Copy link
Member Author

seisman commented Jul 17, 2024

Ping @GenericMappingTools/pygmt-maintainers for reviews.

@seisman seisman requested a review from a team July 17, 2024 06:15
@seisman seisman added final review call This PR requires final review and approval from a second reviewer and removed needs review This PR has higher priority and needs review. run/benchmark Trigger the benchmark workflow in PRs labels Jul 23, 2024
@seisman
Copy link
Member Author

seisman commented Jul 23, 2024

I'll merge this PR in 24 hours if there are no further comments.

@@ -192,7 +192,58 @@ def data_kind(
return kind


def non_ascii_to_octal(argstr: str) -> str:
def check_encoding(argstr: str) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be too much to have a typehint like this?

Suggested change
def check_encoding(argstr: str) -> str:
def check_encoding(argstr: str) -> Literal[
"ascii",
"ISOLatin1+",
"ISO-8859-1",
"ISO-8859-2",
"ISO-8859-3",
"ISO-8859-4",
"ISO-8859-5",
"ISO-8859-6",
"ISO-8859-7",
"ISO-8859-8",
"ISO-8859-9",
"ISO-8859-10",
"ISO-8859-11",
"ISO-8859-13",
"ISO-8859-14",
"ISO-8859-15",
"ISO-8859-16",
"ISO-8859-17",
]:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking about that too. Actually the same types can be used in typing hints the encoding parameter of the non_ascii_to_octal function. So my initial plan is to define a generic type

type Encodings = Literal[
    "ascii",
    "ISOLatin1+",
    "ISO-8859-1",
    "ISO-8859-2",
    "ISO-8859-3",
    "ISO-8859-4",
    "ISO-8859-5",
    "ISO-8859-6",
    "ISO-8859-7",
    "ISO-8859-8",
    "ISO-8859-9",
    "ISO-8859-10",
    "ISO-8859-11",
    "ISO-8859-13",
    "ISO-8859-14",
    "ISO-8859-15",
    "ISO-8859-16",
]

and then use it like:

def check_encoding(argstr: str) -> Encodings:
def non_ascii_to_octal(argstr: str, encoding: Encodings = "ISOLatin1+") -> str:

but mypy v1.10.0 complains

pygmt/helpers/utils.py:21: error: PEP 695 type aliases are not yet supported  [valid-type]

mypy v1.11.0 started to support PEP 695 but it was released just a few days ago (https://mypy-lang.blogspot.com/2024/07/mypy-111-released.html) and the PEP 695 support is still experimental.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 7607c7e.

@@ -19,6 +19,7 @@
args_in_kwargs,
build_arg_list,
build_arg_string,
check_encoding,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on using _check_encoding to keep this function more private? I know we don't document pygmt.helpers.utils in the API docs, but want to avoid users from thinking that this function is somewhat public if there's no leading underscore.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 6728856.

Copy link
Member

@weiji14 weiji14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @seisman, haven't checked too closely, but don't want to hold this up for too long. Made one comment about mixed encoding support in different arguments, but we can discuss that #2204 and decide whether to open a separate PR for it.

@@ -192,17 +192,113 @@ def data_kind(
return kind


def non_ascii_to_octal(argstr: str) -> str:
def _check_encoding(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this private function near the top (together with _validate_data_input)?

Comment on lines +433 to +436
# Convert non-ASCII characters (if any) in the arguments to octal codes
encoding = _check_encoding("".join(gmt_args))
if encoding != "ascii":
gmt_args = [non_ascii_to_octal(arg, encoding=encoding) for arg in gmt_args]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block assumes that there will only be one encoding type? To support multiple encodings in different arguments, maybe change it to something like this (untested):

Suggested change
# Convert non-ASCII characters (if any) in the arguments to octal codes
encoding = _check_encoding("".join(gmt_args))
if encoding != "ascii":
gmt_args = [non_ascii_to_octal(arg, encoding=encoding) for arg in gmt_args]
# Convert non-ASCII characters (if any) in the arguments to octal codes
for i, arg in enumerate(gmt_args):
encoding = _check_encoding("".join(gmt_args))
if encoding != "ascii":
gmt_args[i] = non_ascii_to_octal(arg, encoding=encoding)

But this can be done in a separate PR perhaps. I'm not sure how common it is to mix encodings in different arguments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To support multiple encodings in different arguments, maybe change it to something like this (untested):

It's not possible in GMT. For each GMT module call, we can only pass --PS_CHAR_ENCODING once, so it means that all the arguments in a GMT call must be in the same encoding.

For Figure.text, the text string and any arguments also must use the same encoding.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If people use mixed encodings in different arguments, the default ISOLatin1+ encoding is used.

@seisman seisman merged commit 3502252 into main Jul 23, 2024
20 checks passed
@seisman seisman deleted the iso-encoding branch July 23, 2024 05:49
@seisman seisman removed the final review call This PR requires final review and approval from a second reviewer label Jul 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improving an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants