Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support non-ASCII characters in ISO-8859-x charsets #3310

Merged
merged 35 commits into from
Jul 23, 2024
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
3bcf57f
Add the mapping dictionary for all ISO-8859-x encodings
seisman Jul 3, 2024
4738815
Add a function to check the encoding of a string
seisman Jul 3, 2024
9f0e0f1
Add the encoding parameter to non_ascii_to_octal
seisman Jul 3, 2024
21e91f1
Improve build_arg_list to make it support ISO-8859-x encodings
seisman Jul 3, 2024
01ef6b3
Let Figure.text to support ISO-8859-x encodings
seisman Jul 3, 2024
3c8b979
Update examples/gallery/images/rgb_image.py
seisman Jul 3, 2024
d946636
Add doctests for check_encoding
seisman Jul 4, 2024
7aa07a0
Add doctest for non_ascii_to_octal
seisman Jul 4, 2024
6f3aae4
Add doctests to build_arg_list
seisman Jul 4, 2024
e2fa2c8
Add a test for ISO-8859-x characters
seisman Jul 4, 2024
dca5079
Merge branch 'main' into iso-encoding
seisman Jul 4, 2024
32a6646
Update documentation
seisman Jul 4, 2024
db590a3
Update doc/techref/encodings.md
seisman Jul 4, 2024
7c9bed4
Revert changes in examples/gallery/images/rgb_image.py
seisman Jul 4, 2024
91818e1
Fix links
seisman Jul 4, 2024
9af4efb
Merge remote-tracking branch 'origin/iso-encoding' into iso-encoding
seisman Jul 4, 2024
4734520
Apply suggestions from code review
seisman Jul 5, 2024
df3679b
Merge branch 'main' into iso-encoding
seisman Jul 5, 2024
36b2a2f
Fix a markdown link
seisman Jul 5, 2024
78cc52b
Merge branch 'main' into iso-encoding
seisman Jul 7, 2024
2d01f6b
Check_encoding now returns 'ascii' if the string only contains ASCII …
seisman Jul 8, 2024
43ef0a2
Update the docstrings in Figure.text
seisman Jul 8, 2024
586127d
Merge branch 'main' into iso-encoding
seisman Jul 9, 2024
e8ac6bb
Merge branch 'main' into iso-encoding
seisman Jul 11, 2024
6bd5008
Merge branch 'main' into iso-encoding
seisman Jul 17, 2024
5260749
Merge branch 'main' into iso-encoding
seisman Jul 23, 2024
556fee9
Update docstrings
seisman Jul 23, 2024
c414b4c
Fix a docstring
seisman Jul 23, 2024
7607c7e
Improve docstrings of check_encoding and non_ascii_to_octal
seisman Jul 23, 2024
6728856
Make check_encoding function private
seisman Jul 23, 2024
7a26bfc
non_ascii_to_otcal: return immediately if encoding is ascii
seisman Jul 23, 2024
1636453
Improve Figure.text docstrings
seisman Jul 23, 2024
d90dcc8
Improve docs
seisman Jul 23, 2024
23a2806
Move private function _check_encoding to the top
seisman Jul 23, 2024
3474bb2
Silent a mypy warning
seisman Jul 23, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions doc/techref/encodings.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,3 +106,26 @@ the Unicode character set.
| **\35x** | ➨ | ➩ | ➪ | ➫ | ➬ | ➭ | ➮ | ➯ |
| **\36x** | � | ➱ | ➲ | ➳ | ➴ | ➵ | ➶ | ➷ |
| **\37x** | ➸ | ➹ | ➺ | ➻ | ➼ | ➽ | ➾ | � |

## ISO/IEC 8859

GMT also supports the ISO/IEC 8859 standard for 8-bit character encodings. Refer to
https://en.wikipedia.org/wiki/ISO/IEC_8859 for descriptions of the different parts of the standard.

For a list of the characters in each part of the standard, refer to the following links:

- https://en.wikipedia.org/wiki/ISO/IEC_8859-1
- https://en.wikipedia.org/wiki/ISO/IEC_8859-2
- https://en.wikipedia.org/wiki/ISO/IEC_8859-3
- https://en.wikipedia.org/wiki/ISO/IEC_8859-4
- https://en.wikipedia.org/wiki/ISO/IEC_8859-5
- https://en.wikipedia.org/wiki/ISO/IEC_8859-6
- https://en.wikipedia.org/wiki/ISO/IEC_8859-7
- https://en.wikipedia.org/wiki/ISO/IEC_8859-8
- https://en.wikipedia.org/wiki/ISO/IEC_8859-9
- https://en.wikipedia.org/wiki/ISO/IEC_8859-10
- https://en.wikipedia.org/wiki/ISO/IEC_8859-11
- https://en.wikipedia.org/wiki/ISO/IEC_8859-13
- https://en.wikipedia.org/wiki/ISO/IEC_8859-14
- https://en.wikipedia.org/wiki/ISO/IEC_8859-15
- https://en.wikipedia.org/wiki/ISO/IEC_8859-16
2 changes: 1 addition & 1 deletion examples/gallery/images/rgb_image.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,6 @@
grid=image,
# Use a map scale where 1 cm on the map equals 1 km on the ground
projection="x1:100000",
frame=[r"WSne+tL@!a¯hain@!a¯, Hawai`i on 9 Aug 2023", "af"],
frame=[r"WSne+tLāhainā, Hawai`i on 9 Aug 2023", "af"],
seisman marked this conversation as resolved.
Show resolved Hide resolved
)
fig.show()
20 changes: 16 additions & 4 deletions pygmt/encodings.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
"""
Adobe character encodings supported by GMT.
Character encodings supported by GMT.

Currently, only Adobe Symbol, Adobe ZapfDingbats, and Adobe ISOLatin1+ encodings are
supported.
Currently, Adobe Symbol, Adobe ZapfDingbats, Adobe ISOLatin1+ and ISO-8859-x (x can be
1-11, 13-16) encodings are supported. Adobe Standard+ encoding is not supported.

The corresponding Unicode characters in each Adobe character encoding are generated
from the mapping table and conversion script in the GMT-octal-codes
from the mapping tables and conversion scripts in the GMT-octal-codes
(https://github.com/seisman/GMT-octal-codes) repository. Refer to that repository for
details.

Expand All @@ -22,8 +22,11 @@
- Adobe Symbol: https://en.wikipedia.org/wiki/Symbol_(typeface)
- Zapf Dingbats: https://en.wikipedia.org/wiki/Zapf_Dingbats
seisman marked this conversation as resolved.
Show resolved Hide resolved
- Adobe Glyph List: https://github.com/adobe-type-tools/agl-aglfn
- ISO-8859-x: https://en.wikipedia.org/wiki/ISO/IEC_8859-1
"""

import codecs

# Dictionary of character mappings for different encodings.
charset: dict = {}

Expand Down Expand Up @@ -129,3 +132,12 @@
strict=False,
)
)

# ISO-8859-x charsets and x can be 1-11, 13-16.
for i in range(1, 17):
if i == 12: # ISO-8859-2 was abandoned.
continue
charset[f"ISO-8859-{i}"] = {
code: codecs.decode(bytes([code]), f"iso8859_{i}", errors="replace")
for code in [*range(0o040, 0o200), *range(0o240, 0o400)]
}
seisman marked this conversation as resolved.
Show resolved Hide resolved
1 change: 1 addition & 0 deletions pygmt/helpers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
args_in_kwargs,
build_arg_list,
build_arg_string,
check_encoding,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on using _check_encoding to keep this function more private? I know we don't document pygmt.helpers.utils in the API docs, but want to avoid users from thinking that this function is somewhat public if there's no leading underscore.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 6728856.

data_kind,
is_nonstr_iter,
launch_external_viewer,
Expand Down
75 changes: 69 additions & 6 deletions pygmt/helpers/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -205,7 +205,55 @@ def data_kind(data=None, x=None, y=None, z=None, required_z=False, required_data
return kind


def non_ascii_to_octal(argstr: str) -> str:
def check_encoding(argstr: str) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be too much to have a typehint like this?

Suggested change
def check_encoding(argstr: str) -> str:
def check_encoding(argstr: str) -> Literal[
"ascii",
"ISOLatin1+",
"ISO-8859-1",
"ISO-8859-2",
"ISO-8859-3",
"ISO-8859-4",
"ISO-8859-5",
"ISO-8859-6",
"ISO-8859-7",
"ISO-8859-8",
"ISO-8859-9",
"ISO-8859-10",
"ISO-8859-11",
"ISO-8859-13",
"ISO-8859-14",
"ISO-8859-15",
"ISO-8859-16",
"ISO-8859-17",
]:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking about that too. Actually the same types can be used in typing hints the encoding parameter of the non_ascii_to_octal function. So my initial plan is to define a generic type

type Encodings = Literal[
    "ascii",
    "ISOLatin1+",
    "ISO-8859-1",
    "ISO-8859-2",
    "ISO-8859-3",
    "ISO-8859-4",
    "ISO-8859-5",
    "ISO-8859-6",
    "ISO-8859-7",
    "ISO-8859-8",
    "ISO-8859-9",
    "ISO-8859-10",
    "ISO-8859-11",
    "ISO-8859-13",
    "ISO-8859-14",
    "ISO-8859-15",
    "ISO-8859-16",
]

and then use it like:

def check_encoding(argstr: str) -> Encodings:
def non_ascii_to_octal(argstr: str, encoding: Encodings = "ISOLatin1+") -> str:

but mypy v1.10.0 complains

pygmt/helpers/utils.py:21: error: PEP 695 type aliases are not yet supported  [valid-type]

mypy v1.11.0 started to support PEP 695 but it was released just a few days ago (https://mypy-lang.blogspot.com/2024/07/mypy-111-released.html) and the PEP 695 support is still experimental.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 7607c7e.

"""
Check the charset encoding of a string.

All characters in the string must be in a single charset encoding, otherwise the
default ISOLatin1+ encoding is returned. Characters in the Symbol and ZapfDingbats
fonts are also checked because they're independent on the setting of charset.
seisman marked this conversation as resolved.
Show resolved Hide resolved

Parameters
----------
argstr
The string to be checked.

Returns
-------
encoding
The encoding of the string.

Examples
--------
>>> check_encoding("123ABC+-?!") # ASCII characters only
'ISOLatin1+'
>>> check_encoding("12AB±β①②") # Characters in ISOLatin1+
'ISOLatin1+'
>>> check_encoding("12ABāáâãäåβ①②") # Characters in ISO-8859-4
'ISO-8859-4'
>>> check_encoding("12ABŒā") # Mix characters in ISOLatin1+ (Œ) and ISO-8859-4 (ā)
'ISOLatin1+'
>>> check_encoding("123AB中文") # Characters not in any charset encoding
'ISOLatin1+'
"""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the GMT-supported encodings, I'm thinking if we should return ascii (the name ascii comes from https://docs.python.org/3/library/codecs.html#standard-encodings) if the input string only contains ASCII characters, e.g.,

    if all(32 <= ord(c) <= 126 for c in argstr):
        return "ascii"

Since in most cases, the arguments and the text string contain ASCII characters only. When ascii encoding is detected, we no longer need to apply non_ascii_to_octal to the strings, which may improve the performance for most cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 2d01f6b.

# Loop through all supported encodings and check if all characters in the string
# are in the charset of the encoding. If all characters are in the charset, return
# the encoding. The ISOLatin1+ encoding is checked first because it is the default
# and most common encoding.
adobe_chars = set(charset["Symbol"].values()) | set(
charset["ZapfDingbats"].values()
)
for encoding in ["ISOLatin1+"] + [f"ISO-8859-{i}" for i in range(1, 17)]:
if encoding == "ISO-8859-12": # ISO-8859-12 was abandoned. Skip it.
continue
if all(c in (set(charset[encoding].values()) | adobe_chars) for c in argstr):
return encoding
# Return the "ISOLatin1+" encoding if the string contains characters from multiple
# charset encodings or contains characters that are not in any charset encoding.
return "ISOLatin1+"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function returns ISOLatin1+/ISO-8859-1/.../ISO-8859-16, but Python's standard encodings are latin_1/iso8859_1/...iso8859_16 (https://docs.python.org/3/library/codecs.html#standard-encodings). They're inconsistent, but using names like ISOLatin1+ can greatly simplify the codes.



def non_ascii_to_octal(argstr: str, encoding: str = "ISOLatin1+") -> str:
r"""
Translate non-ASCII characters to their corresponding octal codes.

Expand All @@ -216,6 +264,8 @@ def non_ascii_to_octal(argstr: str) -> str:
----------
argstr
The string to be translated.
encoding
The encoding of characters in the string.

Returns
-------
Expand All @@ -232,6 +282,8 @@ def non_ascii_to_octal(argstr: str) -> str:
'@%34%\\041@%%@%34%\\176@%%@%34%\\241@%%@%34%\\376@%%'
>>> non_ascii_to_octal("ABC ±120° DEF α ♥")
'ABC \\261120\\260 DEF @~\\141@~ @%34%\\252@%%'
>>> non_ascii_to_octal("12ABāáâãäåβ①②", encoding="ISO-8859-4")
'12AB\\340\\341\\342\\343\\344\\345@~\\142@~@%34%\\254@%%@%34%\\255@%%'
""" # noqa: RUF002
# Return the string if it only contains printable ASCII characters from 32 to 126.
if all(32 <= ord(c) <= 126 for c in argstr):
Expand All @@ -245,8 +297,8 @@ def non_ascii_to_octal(argstr: str) -> str:
mapping.update(
{c: f"@%34%\\{i:03o}@%%" for i, c in charset["ZapfDingbats"].items()}
)
# Adobe ISOLatin1+ charset. Put at the end.
mapping.update({c: f"\\{i:03o}" for i, c in charset["ISOLatin1+"].items()})
# ISOLatin1+ or ISO-8859-x charset.
mapping.update({c: f"\\{i:03o}" for i, c in charset[encoding].items()})

# Remove any printable characters
mapping = {k: v for k, v in mapping.items() if k not in string.printable}
Expand Down Expand Up @@ -323,6 +375,10 @@ def build_arg_list(
... )
... )
['f1.txt', 'f2.txt', '-A0', '-B', '--FORMAT_DATE_MAP=o dd', '->out.txt']
>>> build_arg_list(dict(B="12ABāβ①②"))
['-B12AB\\340@~\\142@~@%34%\\254@%%@%34%\\255@%%', '--PS_CHAR_ENCODING=ISO-8859-4']
>>> build_arg_list(dict(B="12ABāβ①②"), confdict=dict(PS_CHAR_ENCODING="ISO-8859-5"))
['-B12AB\\340@~\\142@~@%34%\\254@%%@%34%\\255@%%', '--PS_CHAR_ENCODING=ISO-8859-5']
>>> print(build_arg_list(dict(R="1/2/3/4", J="X4i", watre=True)))
Traceback (most recent call last):
...
Expand All @@ -337,10 +393,17 @@ def build_arg_list(
elif value is True:
gmt_args.append(f"-{key}")
elif is_nonstr_iter(value):
gmt_args.extend(non_ascii_to_octal(f"-{key}{_value}") for _value in value)
gmt_args.extend(f"-{key}{_value}" for _value in value)
else:
gmt_args.append(non_ascii_to_octal(f"-{key}{value}"))
gmt_args = sorted(gmt_args)
gmt_args.append(f"-{key}{value}")

# Convert non-ASCII characters (if any) in the arguments to octal codes
encoding = check_encoding(" ".join(gmt_args))
gmt_args = sorted([non_ascii_to_octal(arg, encoding=encoding) for arg in gmt_args])

# Set --PS_CHAR_ENCODING=encoding if necessary
if encoding != "ISOLatin1+" and not (confdict and "PS_CHAR_ENCODING" in confdict):
gmt_args.append(f"--PS_CHAR_ENCODING={encoding}")

if confdict:
gmt_args.extend(f"--{key}={value}" for key, value in confdict.items())
Expand Down
15 changes: 13 additions & 2 deletions pygmt/src/text.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
from pygmt.exceptions import GMTInvalidInput
from pygmt.helpers import (
build_arg_list,
check_encoding,
data_kind,
fmt_docstring,
is_nonstr_iter,
Expand Down Expand Up @@ -226,13 +227,23 @@ def text_( # noqa: PLR0912
kwargs["t"] = ""

# Append text at last column. Text must be passed in as str type.
confdict = {}
if kind == "vectors":
text = np.atleast_1d(text).astype(str)
encoding = check_encoding("".join(text))
extra_arrays.append(
np.vectorize(non_ascii_to_octal)(np.atleast_1d(text).astype(str))
np.vectorize(non_ascii_to_octal, excluded="encoding")(
text, encoding=encoding
)
)
if encoding != "ISOLatin1+":
confdict = {"PS_CHAR_ENCODING": encoding}

with Session() as lib:
with lib.virtualfile_in(
check_kind="vector", data=textfiles, x=x, y=y, extra_arrays=extra_arrays
) as vintbl:
lib.call_module(module="text", args=build_arg_list(kwargs, infile=vintbl))
lib.call_module(
module="text",
args=build_arg_list(kwargs, infile=vintbl, confdict=confdict),
)
5 changes: 5 additions & 0 deletions pygmt/tests/baseline/test_text_nonascii_iso8859.png.dvc
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
outs:
- md5: a0f35a1d58c95e6589c7397e7660e946
size: 17089
hash: md5
path: test_text_nonascii_iso8859.png
13 changes: 13 additions & 0 deletions pygmt/tests/test_text.py
Original file line number Diff line number Diff line change
Expand Up @@ -432,3 +432,16 @@ def test_text_quotation_marks():
fig.basemap(projection="X4c/2c", region=[0, 4, 0, 2], frame=0)
fig.text(x=2, y=1, text='\\234 ‘ ’ " “ ”', font="20p") # noqa: RUF001
return fig


@pytest.mark.mpl_image_compare
def test_text_nonascii_iso8859():
"""
Test passing text strings with non-ascii characters in ISO-8859-4 encoding.
"""
fig = Figure()
fig.basemap(region=[0, 10, 0, 10], projection="X10c", frame=["WSEN+tAāáâãäåB"])
fig.text(position="TL", text="position-text:1ÉĘËĖ2")
fig.text(x=1, y=1, text="xytext:1éęëė2")
fig.text(x=[5, 5], y=[3, 5], text=["xytext1:ųúûüũūαζ∆❡", "xytext2:íîī∑π∇✉"])
return fig
Loading