Support non-ASCII characters in ISO-8859-x charsets #3310

seisman · 2024-07-03T13:08:58Z

This PR enhances PyGMT to support non-ASCII characters in ISO-8859-x charsets (Refer to https://en.wikipedia.org/wiki/ISO/IEC_8859-1, https://en.wikipedia.org/wiki/ISO/IEC_8859-2 and more for available characters in each encoding). Related to #2204.

In GMT, the default encoding is ISOLatin1+ (or Standard+ but it's not supported in PyGMT yet). To use non-ASCII characters in ISO-8859-x, we need to use --PS_CHAR_ENCODING=<encoding> according to https://docs.generic-mapping-tools.org/dev/gmt.conf.html#term-PS_CHAR_ENCODING. Please note that we can only use one encoding in a single GMT module call, which means mixing characters from different encodings will cause trouble. So, the implementation is simple. We just need to check the argument string or Figure.text's text string and determine the charset encoding to use, and then add --PS_CHAR_ENCODING=<encoding> when calling modules.

This PR does a few things:

Add the mapping dictionaries of all ISO-8859-x charsets to pygmt/encodings.py
Add check_encoding to check the encoding of a string.
Add the new encoding parameter to non_ascii_to_otcal so that we know the encoding to use when converting non-ascii.
Update to make PyGMT arguments and Figure.text's text parameter support ISO-8859-x charsets.

Preview : https://pygmt-dev--3310.org.readthedocs.build/en/3310/techref/encodings.html

Here is an example showing how it works with non-ASCII characters in ISO-8859-4. The same script is added as a test.

import pygmt

fig = pygmt.Figure()
fig.basemap(region=[0, 10, 0, 10], projection="X10c", frame=["WSEN+tAāáâãäåB"])
fig.text(position="TL", text="position-text:1ÉĘËĖ2")
fig.text(x=1, y=1, text="xytext:1éęëė2")
fig.text(x=[5, 5], y=[3, 5], text=["xytext1:ųúûüũūαζ∆❡", "xytext2:íîī∑π∇✉"])
fig.show()

The output is

The script below shows all the characters in different ISO-8859-x encodings. I don't check them all carefully, but most look correct. The exceptions are ISO-8859-6, ISO-8859-8, and ISO-8859-11. They don't work with standard fonts (https://docs.generic-mapping-tools.org/dev/gmt.conf.html#term-PS_CHAR_ENCODING) but I think it's OK because whoever wants to use these characters should have the fonts installed. Some characters like Ṁ in ISO-8859-14 are missing in the figure below and they also can't be shown correctly in GMT CLI. Looking at the charset definition files in the GMT source codes, the character Ṁ has a PS name like /uni1E40 (https://github.com/GenericMappingTools/gmt/blob/0584d620d27de5ed93a90e4b0dc56a2edb2d568a/src/PSL_ISO-8859-14.h#L24C43-L24C43), but it doesn't seem like a valid PS name. So, it must be upstream issues if any.

import pygmt
from pygmt.encodings import charset

fig = pygmt.Figure()
with fig.subplot(ncols=5, nrows=3, subsize=("10c", "15c"), margins=0.5):

    for i in range(1, 17):
        if i == 12:
            continue
        encoding = f"ISO-8859-{i}"
        chars = "".join(charset[encoding].values())
        fig.basemap(region=[0, 16, 0, 13], projection="X?/-?", frame=[0, f"+t{encoding}"], panel=True)
        for j in range(0, len(chars), 16):
            fig.text(x=1, y=int(j/16) + 1, text=chars[j:j+16], font="20p", justify="ML")
fig.savefig("encodings.png")

github-actions · 2024-07-04T03:17:53Z

Summary of changed images

This is an auto-generated report of images that have changed on the DVC remote

Status	Path
added	pygmt/tests/baseline/test_text_nonascii_iso8859.png

Image diff(s)

Added images

test_text_nonascii_iso8859.png

Modified images

Path	Old	New

Report last updated at commit 3474bb2

doc/techref/encodings.md

Co-authored-by: Michael Grund <[email protected]>

examples/gallery/images/rgb_image.py

pygmt/encodings.py

pygmt/helpers/utils.py

Co-authored-by: Yvonne Fröhlich <[email protected]>

seisman · 2024-07-07T04:41:03Z

pygmt/helpers/utils.py

+            return encoding
+    # Return the "ISOLatin1+" encoding if the string contains characters from multiple
+    # charset encodings or contains characters that are not in any charset encoding.
+    return "ISOLatin1+"


The function returns ISOLatin1+/ISO-8859-1/.../ISO-8859-16, but Python's standard encodings are latin_1/iso8859_1/...iso8859_16 (https://docs.python.org/3/library/codecs.html#standard-encodings). They're inconsistent, but using names like ISOLatin1+ can greatly simplify the codes.

seisman · 2024-07-07T04:42:08Z

pygmt/helpers/utils.py

+    'ISOLatin1+'
+    >>> check_encoding("123AB中文")  # Characters not in any charset encoding
+    'ISOLatin1+'
+    """


In addition to the GMT-supported encodings, I'm thinking if we should return ascii (the name ascii comes from https://docs.python.org/3/library/codecs.html#standard-encodings) if the input string only contains ASCII characters, e.g.,

if all(32 <= ord(c) <= 126 for c in argstr): return "ascii"

Since in most cases, the arguments and the text string contain ASCII characters only. When ascii encoding is detected, we no longer need to apply non_ascii_to_octal to the strings, which may improve the performance for most cases.

Done in 2d01f6b.

codspeed-hq · 2024-07-07T23:13:49Z

CodSpeed Performance Report

Merging #3310 will not alter performance

_{Comparing iso-encoding (6bd5008) with main (318a8c4)}

Summary

✅ 101 untouched benchmarks

…characters

seisman · 2024-07-17T02:41:07Z

Ping @GenericMappingTools/pygmt-maintainers for reviews.

seisman · 2024-07-23T01:46:31Z

I'll merge this PR in 24 hours if there are no further comments.

weiji14 · 2024-07-23T03:18:01Z

pygmt/helpers/utils.py

@@ -192,7 +192,58 @@ def data_kind(
    return kind


-def non_ascii_to_octal(argstr: str) -> str:
+def check_encoding(argstr: str) -> str:


Would it be too much to have a typehint like this?

Suggested change

def check_encoding(argstr: str) -> str:

def check_encoding(argstr: str) -> Literal[

"ascii",

"ISOLatin1+",

"ISO-8859-1",

"ISO-8859-2",

"ISO-8859-3",

"ISO-8859-4",

"ISO-8859-5",

"ISO-8859-6",

"ISO-8859-7",

"ISO-8859-8",

"ISO-8859-9",

"ISO-8859-10",

"ISO-8859-11",

"ISO-8859-13",

"ISO-8859-14",

"ISO-8859-15",

"ISO-8859-16",

"ISO-8859-17",

]:

I'm thinking about that too. Actually the same types can be used in typing hints the encoding parameter of the non_ascii_to_octal function. So my initial plan is to define a generic type

type Encodings = Literal[ "ascii", "ISOLatin1+", "ISO-8859-1", "ISO-8859-2", "ISO-8859-3", "ISO-8859-4", "ISO-8859-5", "ISO-8859-6", "ISO-8859-7", "ISO-8859-8", "ISO-8859-9", "ISO-8859-10", "ISO-8859-11", "ISO-8859-13", "ISO-8859-14", "ISO-8859-15", "ISO-8859-16", ]

and then use it like:

def check_encoding(argstr: str) -> Encodings: def non_ascii_to_octal(argstr: str, encoding: Encodings = "ISOLatin1+") -> str:

but mypy v1.10.0 complains

pygmt/helpers/utils.py:21: error: PEP 695 type aliases are not yet supported [valid-type]

mypy v1.11.0 started to support PEP 695 but it was released just a few days ago (https://mypy-lang.blogspot.com/2024/07/mypy-111-released.html) and the PEP 695 support is still experimental.

Done in 7607c7e.

weiji14 · 2024-07-23T03:21:16Z

pygmt/helpers/__init__.py

@@ -19,6 +19,7 @@
    args_in_kwargs,
    build_arg_list,
    build_arg_string,
+    check_encoding,


Thoughts on using _check_encoding to keep this function more private? I know we don't document pygmt.helpers.utils in the API docs, but want to avoid users from thinking that this function is somewhat public if there's no leading underscore.

Done in 6728856.

weiji14

Thanks @seisman, haven't checked too closely, but don't want to hold this up for too long. Made one comment about mixed encoding support in different arguments, but we can discuss that #2204 and decide whether to open a separate PR for it.

weiji14 · 2024-07-23T04:45:30Z

pygmt/helpers/utils.py

@@ -192,17 +192,113 @@ def data_kind(
    return kind


-def non_ascii_to_octal(argstr: str) -> str:
+def _check_encoding(


Move this private function near the top (together with _validate_data_input)?

weiji14 · 2024-07-23T04:54:14Z

pygmt/helpers/utils.py

+    # Convert non-ASCII characters (if any) in the arguments to octal codes
+    encoding = _check_encoding("".join(gmt_args))
+    if encoding != "ascii":
+        gmt_args = [non_ascii_to_octal(arg, encoding=encoding) for arg in gmt_args]


This block assumes that there will only be one encoding type? To support multiple encodings in different arguments, maybe change it to something like this (untested):

Suggested change

# Convert non-ASCII characters (if any) in the arguments to octal codes

encoding = _check_encoding("".join(gmt_args))

if encoding != "ascii":

gmt_args = [non_ascii_to_octal(arg, encoding=encoding) for arg in gmt_args]

# Convert non-ASCII characters (if any) in the arguments to octal codes

for i, arg in enumerate(gmt_args):

encoding = _check_encoding("".join(gmt_args))

if encoding != "ascii":

gmt_args[i] = non_ascii_to_octal(arg, encoding=encoding)

But this can be done in a separate PR perhaps. I'm not sure how common it is to mix encodings in different arguments.

To support multiple encodings in different arguments, maybe change it to something like this (untested):

It's not possible in GMT. For each GMT module call, we can only pass --PS_CHAR_ENCODING once, so it means that all the arguments in a GMT call must be in the same encoding.

For Figure.text, the text string and any arguments also must use the same encoding.

If people use mixed encodings in different arguments, the default ISOLatin1+ encoding is used.

seisman changed the title ~~Support non-ASCII characters in ISO-8859-x charsets~~ WIP: Support non-ASCII characters in ISO-8859-x charsets Jul 3, 2024

seisman added 6 commits July 4, 2024 00:17

Add the mapping dictionary for all ISO-8859-x encodings

3bcf57f

Add a function to check the encoding of a string

4738815

Add the encoding parameter to non_ascii_to_octal

9f0e0f1

Improve build_arg_list to make it support ISO-8859-x encodings

21e91f1

Let Figure.text to support ISO-8859-x encodings

01ef6b3

Update examples/gallery/images/rgb_image.py

3c8b979

seisman force-pushed the iso-encoding branch from 8ac16c1 to 3c8b979 Compare July 3, 2024 16:54

seisman added the feature Brand new feature label Jul 3, 2024

seisman added this to the 0.13.0 milestone Jul 3, 2024

seisman added 4 commits July 4, 2024 10:14

Add doctests for check_encoding

d946636

Add doctest for non_ascii_to_octal

7aa07a0

Add doctests to build_arg_list

6f3aae4

Add a test for ISO-8859-x characters

e2fa2c8

seisman added 2 commits July 4, 2024 14:23

Merge branch 'main' into iso-encoding

dca5079

Update documentation

32a6646

michaelgrund reviewed Jul 4, 2024

View reviewed changes

doc/techref/encodings.md Outdated Show resolved Hide resolved

Update doc/techref/encodings.md

db590a3

Co-authored-by: Michael Grund <[email protected]>

seisman commented Jul 4, 2024

View reviewed changes

examples/gallery/images/rgb_image.py Outdated Show resolved Hide resolved

Revert changes in examples/gallery/images/rgb_image.py

7c9bed4

seisman changed the title ~~WIP: Support non-ASCII characters in ISO-8859-x charsets~~ Support non-ASCII characters in ISO-8859-x charsets Jul 4, 2024

seisman added enhancement Improving an existing feature needs review This PR has higher priority and needs review. and removed feature Brand new feature labels Jul 4, 2024

seisman marked this pull request as ready for review July 4, 2024 08:29

seisman added 2 commits July 4, 2024 16:33

Fix links

91818e1

Merge remote-tracking branch 'origin/iso-encoding' into iso-encoding

9af4efb

yvonnefroehlich reviewed Jul 4, 2024

View reviewed changes

pygmt/encodings.py Outdated Show resolved Hide resolved

pygmt/encodings.py Outdated Show resolved Hide resolved

pygmt/helpers/utils.py Outdated Show resolved Hide resolved

Apply suggestions from code review

4734520

Co-authored-by: Yvonne Fröhlich <[email protected]>

seisman commented Jul 7, 2024

View reviewed changes

seisman added the run/benchmark Trigger the benchmark workflow in PRs label Jul 7, 2024

Merge branch 'main' into iso-encoding

78cc52b

seisman added 5 commits July 8, 2024 11:29

Check_encoding now returns 'ascii' if the string only contains ASCII …

2d01f6b

…characters

Update the docstrings in Figure.text

43ef0a2

Merge branch 'main' into iso-encoding

586127d

Merge branch 'main' into iso-encoding

e8ac6bb

Merge branch 'main' into iso-encoding

6bd5008

seisman requested a review from a team July 17, 2024 06:15

seisman added final review call This PR requires final review and approval from a second reviewer and removed needs review This PR has higher priority and needs review. run/benchmark Trigger the benchmark workflow in PRs labels Jul 23, 2024

seisman added 3 commits July 23, 2024 09:46

Merge branch 'main' into iso-encoding

5260749

Update docstrings

556fee9

Fix a docstring

c414b4c

weiji14 reviewed Jul 23, 2024

View reviewed changes

seisman added 3 commits July 23, 2024 12:22

Improve docstrings of check_encoding and non_ascii_to_octal

7607c7e

Make check_encoding function private

6728856

non_ascii_to_otcal: return immediately if encoding is ascii

7a26bfc

weiji14 approved these changes Jul 23, 2024

View reviewed changes

seisman added 4 commits July 23, 2024 12:59

Improve Figure.text docstrings

1636453

Improve docs

d90dcc8

Move private function _check_encoding to the top

23a2806

Silent a mypy warning

3474bb2

seisman merged commit 3502252 into main Jul 23, 2024
20 checks passed

seisman deleted the iso-encoding branch July 23, 2024 05:49

seisman removed the final review call This PR requires final review and approval from a second reviewer label Jul 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support non-ASCII characters in ISO-8859-x charsets #3310

Support non-ASCII characters in ISO-8859-x charsets #3310

seisman commented Jul 3, 2024 •

edited

Loading

github-actions bot commented Jul 4, 2024 •

edited

Loading

Added images

Modified images

seisman Jul 7, 2024

seisman Jul 7, 2024

seisman Jul 9, 2024

codspeed-hq bot commented Jul 7, 2024 •

edited

Loading

seisman commented Jul 17, 2024

seisman commented Jul 23, 2024

weiji14 Jul 23, 2024

seisman Jul 23, 2024

seisman Jul 23, 2024

weiji14 Jul 23, 2024

seisman Jul 23, 2024

weiji14 left a comment

weiji14 Jul 23, 2024

weiji14 Jul 23, 2024

seisman Jul 23, 2024

seisman Jul 23, 2024

-def check_encoding(argstr: str) -> str:
+def check_encoding(argstr: str) -> Literal[
+    "ascii",
+    "ISOLatin1+",
+    "ISO-8859-1",
+    "ISO-8859-2",
+    "ISO-8859-3",
+    "ISO-8859-4",
+    "ISO-8859-5",
+    "ISO-8859-6",
+    "ISO-8859-7",
+    "ISO-8859-8",
+    "ISO-8859-9",
+    "ISO-8859-10",
+    "ISO-8859-11",
+    "ISO-8859-13",
+    "ISO-8859-14",
+    "ISO-8859-15",
+    "ISO-8859-16",
+    "ISO-8859-17",
+]:

Support non-ASCII characters in ISO-8859-x charsets #3310

Support non-ASCII characters in ISO-8859-x charsets #3310

Conversation

seisman commented Jul 3, 2024 • edited Loading

github-actions bot commented Jul 4, 2024 • edited Loading

Summary of changed images

Image diff(s)

Added images

Modified images

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codspeed-hq bot commented Jul 7, 2024 • edited Loading

Merging #3310 will not alter performance

Summary

seisman commented Jul 17, 2024

seisman commented Jul 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

weiji14 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seisman commented Jul 3, 2024 •

edited

Loading

github-actions bot commented Jul 4, 2024 •

edited

Loading

codspeed-hq bot commented Jul 7, 2024 •

edited

Loading