Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support non-ASCII characters in ISO-8859-x charsets #3310

Merged
merged 35 commits into from
Jul 23, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
3bcf57f
Add the mapping dictionary for all ISO-8859-x encodings
seisman Jul 3, 2024
4738815
Add a function to check the encoding of a string
seisman Jul 3, 2024
9f0e0f1
Add the encoding parameter to non_ascii_to_octal
seisman Jul 3, 2024
21e91f1
Improve build_arg_list to make it support ISO-8859-x encodings
seisman Jul 3, 2024
01ef6b3
Let Figure.text to support ISO-8859-x encodings
seisman Jul 3, 2024
3c8b979
Update examples/gallery/images/rgb_image.py
seisman Jul 3, 2024
d946636
Add doctests for check_encoding
seisman Jul 4, 2024
7aa07a0
Add doctest for non_ascii_to_octal
seisman Jul 4, 2024
6f3aae4
Add doctests to build_arg_list
seisman Jul 4, 2024
e2fa2c8
Add a test for ISO-8859-x characters
seisman Jul 4, 2024
dca5079
Merge branch 'main' into iso-encoding
seisman Jul 4, 2024
32a6646
Update documentation
seisman Jul 4, 2024
db590a3
Update doc/techref/encodings.md
seisman Jul 4, 2024
7c9bed4
Revert changes in examples/gallery/images/rgb_image.py
seisman Jul 4, 2024
91818e1
Fix links
seisman Jul 4, 2024
9af4efb
Merge remote-tracking branch 'origin/iso-encoding' into iso-encoding
seisman Jul 4, 2024
4734520
Apply suggestions from code review
seisman Jul 5, 2024
df3679b
Merge branch 'main' into iso-encoding
seisman Jul 5, 2024
36b2a2f
Fix a markdown link
seisman Jul 5, 2024
78cc52b
Merge branch 'main' into iso-encoding
seisman Jul 7, 2024
2d01f6b
Check_encoding now returns 'ascii' if the string only contains ASCII …
seisman Jul 8, 2024
43ef0a2
Update the docstrings in Figure.text
seisman Jul 8, 2024
586127d
Merge branch 'main' into iso-encoding
seisman Jul 9, 2024
e8ac6bb
Merge branch 'main' into iso-encoding
seisman Jul 11, 2024
6bd5008
Merge branch 'main' into iso-encoding
seisman Jul 17, 2024
5260749
Merge branch 'main' into iso-encoding
seisman Jul 23, 2024
556fee9
Update docstrings
seisman Jul 23, 2024
c414b4c
Fix a docstring
seisman Jul 23, 2024
7607c7e
Improve docstrings of check_encoding and non_ascii_to_octal
seisman Jul 23, 2024
6728856
Make check_encoding function private
seisman Jul 23, 2024
7a26bfc
non_ascii_to_otcal: return immediately if encoding is ascii
seisman Jul 23, 2024
1636453
Improve Figure.text docstrings
seisman Jul 23, 2024
d90dcc8
Improve docs
seisman Jul 23, 2024
23a2806
Move private function _check_encoding to the top
seisman Jul 23, 2024
3474bb2
Silent a mypy warning
seisman Jul 23, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 30 additions & 8 deletions doc/techref/encodings.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,12 @@
# Supported Encodings and Non-ASCII Characters

GMT supports a number of encodings and each encoding contains a set of ASCII and non-ASCII
characters. Below are some of the most common encodings and characters that are supported.
GMT supports a number of encodings and each encoding contains a set of ASCII and
non-ASCII characters. In PyGMT, you can use any of these ASCII and non-ASCII characters
in arguments and text strings. When using non-ASCII characters in PyGMT, the easiest way
is to copy and paste the character from the encoding tables below.

In PyGMT, you can use any of these ASCII and non-ASCII characters in arguments and text
strings. When using non-ASCII characters in PyGMT, the easiest way is to copy and paste
the character from the tables below.

**Note**: The special character � (REPLACEMENT CHARACTER) is used to indicate that
the character is not defined in the encoding.
**Note**: The special character � (REPLACEMENT CHARACTER) is used to indicate
that the character is not defined in the encoding.

## Adobe ISOLatin1+ Encoding

Expand Down Expand Up @@ -106,3 +104,27 @@ the Unicode character set.
| **\35x** | ➨ | ➩ | ➪ | ➫ | ➬ | ➭ | ➮ | ➯ |
| **\36x** | � | ➱ | ➲ | ➳ | ➴ | ➵ | ➶ | ➷ |
| **\37x** | ➸ | ➹ | ➺ | ➻ | ➼ | ➽ | ➾ | � |

## ISO/IEC 8859

GMT also supports the ISO/IEC 8859 standard for 8-bit character encodings. Refer to
<https://en.wikipedia.org/wiki/ISO/IEC_8859> for descriptions of the different parts of
the standard.

For a list of the characters in each part of the standard, refer to the following links:

- <https://en.wikipedia.org/wiki/ISO/IEC_8859-1>
- <https://en.wikipedia.org/wiki/ISO/IEC_8859-2>
- <https://en.wikipedia.org/wiki/ISO/IEC_8859-3>
- <https://en.wikipedia.org/wiki/ISO/IEC_8859-4>
- <https://en.wikipedia.org/wiki/ISO/IEC_8859-5>
- <https://en.wikipedia.org/wiki/ISO/IEC_8859-6>
- <https://en.wikipedia.org/wiki/ISO/IEC_8859-7>
- <https://en.wikipedia.org/wiki/ISO/IEC_8859-8>
- <https://en.wikipedia.org/wiki/ISO/IEC_8859-9>
- <https://en.wikipedia.org/wiki/ISO/IEC_8859-10>
- <https://en.wikipedia.org/wiki/ISO/IEC_8859-11>
- <https://en.wikipedia.org/wiki/ISO/IEC_8859-13>
- <https://en.wikipedia.org/wiki/ISO/IEC_8859-14>
- <https://en.wikipedia.org/wiki/ISO/IEC_8859-15>
- <https://en.wikipedia.org/wiki/ISO/IEC_8859-16>
32 changes: 22 additions & 10 deletions pygmt/encodings.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
"""
Adobe character encodings supported by GMT.
Character encodings supported by GMT.

Currently, only Adobe Symbol, Adobe ZapfDingbats, and Adobe ISOLatin1+ encodings are
supported.
Currently, Adobe Symbol, Adobe ZapfDingbats, Adobe ISOLatin1+ and ISO-8859-x (x can be
1-11, 13-16) encodings are supported. Adobe Standard encoding is not supported.

The corresponding Unicode characters in each Adobe character encoding are generated
from the mapping table and conversion script in the GMT-octal-codes
(https://github.com/seisman/GMT-octal-codes) repository. Refer to that repository for
details.
The corresponding Unicode characters in each Adobe character encoding are generated from
the mapping tables and conversion scripts in the
`GMT-octal-codes repository <https://github.com/seisman/GMT-octal-codes>`__. Refer to
that repository for details.

Some code points are undefined and are assigned with the replacement character
(``\ufffd``).
Expand All @@ -16,14 +16,17 @@
----------

- GMT-octal-codes: https://github.com/seisman/GMT-octal-codes
- GMT official documentation: https://docs.generic-mapping-tools.org/dev/reference/octal-codes.html
- GMT documentation: https://docs.generic-mapping-tools.org/dev/reference/octal-codes.html
- Adobe Postscript Language Reference: https://www.adobe.com/jp/print/postscript/pdfs/PLRM.pdf
- ISOLatin1+: https://en.wikipedia.org/wiki/PostScript_Latin_1_Encoding
- Adobe ISOLatin1+: https://en.wikipedia.org/wiki/PostScript_Latin_1_Encoding
- Adobe Symbol: https://en.wikipedia.org/wiki/Symbol_(typeface)
- Zapf Dingbats: https://en.wikipedia.org/wiki/Zapf_Dingbats
- Adobe ZapfDingbats: https://en.wikipedia.org/wiki/Zapf_Dingbats
- Adobe Glyph List: https://github.com/adobe-type-tools/agl-aglfn
- ISO-8859: https://en.wikipedia.org/wiki/ISO/IEC_8859
"""

import codecs

# Dictionary of character mappings for different encodings.
charset: dict = {}

Expand Down Expand Up @@ -129,3 +132,12 @@
strict=False,
)
)

# ISO-8859-x charsets and x can be 1-11, 13-16.
for i in range(1, 17):
if i == 12: # ISO-8859-12 was abandoned.
continue
charset[f"ISO-8859-{i}"] = {
code: codecs.decode(bytes([code]), f"iso8859_{i}", errors="replace")
for code in [*range(0o040, 0o200), *range(0o240, 0o400)]
}
1 change: 1 addition & 0 deletions pygmt/helpers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
unique_name,
)
from pygmt.helpers.utils import (
_check_encoding,
_validate_data_input,
args_in_kwargs,
build_arg_list,
Expand Down
133 changes: 123 additions & 10 deletions pygmt/helpers/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,78 @@ def _validate_data_input(
raise GMTInvalidInput("data must provide x, y, and z columns.")


def _check_encoding(
argstr: str,
) -> Literal[
"ascii",
"ISOLatin1+",
"ISO-8859-1",
"ISO-8859-2",
"ISO-8859-3",
"ISO-8859-4",
"ISO-8859-5",
"ISO-8859-6",
"ISO-8859-7",
"ISO-8859-8",
"ISO-8859-9",
"ISO-8859-10",
"ISO-8859-11",
"ISO-8859-13",
"ISO-8859-14",
"ISO-8859-15",
"ISO-8859-16",
]:
"""
Check the charset encoding of a string.

All characters in the string must be in the same charset encoding, otherwise the
default ``ISOLatin1+`` encoding is returned. Characters in the Adobe Symbol and
ZapfDingbats encodings are also checked because they're independent on the choice of
encodings.

Parameters
----------
argstr
The string to be checked.

Returns
-------
encoding
The encoding of the string.

Examples
--------
>>> _check_encoding("123ABC+-?!") # ASCII characters only
'ascii'
>>> _check_encoding("12AB±β①②") # Characters in ISOLatin1+
'ISOLatin1+'
>>> _check_encoding("12ABāáâãäåβ①②") # Characters in ISO-8859-4
'ISO-8859-4'
>>> _check_encoding("12ABŒā") # Mix characters in ISOLatin1+ (Œ) and ISO-8859-4 (ā)
'ISOLatin1+'
>>> _check_encoding("123AB中文") # Characters not in any charset encoding
'ISOLatin1+'
"""
# Return "ascii" if the string only contains ASCII characters.
if all(32 <= ord(c) <= 126 for c in argstr):
return "ascii"
# Loop through all supported encodings and check if all characters in the string
# are in the charset of the encoding. If all characters are in the charset, return
# the encoding. The ISOLatin1+ encoding is checked first because it is the default
# and most common encoding.
adobe_chars = set(charset["Symbol"].values()) | set(
charset["ZapfDingbats"].values()
)
for encoding in ["ISOLatin1+"] + [f"ISO-8859-{i}" for i in range(1, 17)]:
if encoding == "ISO-8859-12": # ISO-8859-12 was abandoned. Skip it.
continue
if all(c in (set(charset[encoding].values()) | adobe_chars) for c in argstr):
return encoding # type: ignore[return-value]
# Return the "ISOLatin1+" encoding if the string contains characters from multiple
# charset encodings or contains characters that are not in any charset encoding.
return "ISOLatin1+"


def data_kind(
data: Any = None, required: bool = True
) -> Literal["arg", "file", "geojson", "grid", "image", "matrix", "vectors"]:
Expand Down Expand Up @@ -192,17 +264,41 @@ def data_kind(
return kind


def non_ascii_to_octal(argstr: str) -> str:
def non_ascii_to_octal(
argstr: str,
encoding: Literal[
"ascii",
"ISOLatin1+",
"ISO-8859-1",
"ISO-8859-2",
"ISO-8859-3",
"ISO-8859-4",
"ISO-8859-5",
"ISO-8859-6",
"ISO-8859-7",
"ISO-8859-8",
"ISO-8859-9",
"ISO-8859-10",
"ISO-8859-11",
"ISO-8859-13",
"ISO-8859-14",
"ISO-8859-15",
"ISO-8859-16",
] = "ISOLatin1+",
) -> str:
r"""
Translate non-ASCII characters to their corresponding octal codes.

Currently, only characters in the ISOLatin1+ charset and Symbol/ZapfDingbats fonts
are supported.
Currently, only non-ASCII characters in the Adobe ISOLatin1+, Adobe Symbol, Adobe
ZapfDingbats, and ISO-8850-x (x can be in 1-11, 13-17) encodings are supported.
The Adobe Standard encoding is not supported yet.

Parameters
----------
argstr
The string to be translated.
encoding
The encoding of characters in the string.

Returns
-------
Expand All @@ -219,9 +315,11 @@ def non_ascii_to_octal(argstr: str) -> str:
'@%34%\\041@%%@%34%\\176@%%@%34%\\241@%%@%34%\\376@%%'
>>> non_ascii_to_octal("ABC ±120° DEF α ♥")
'ABC \\261120\\260 DEF @~\\141@~ @%34%\\252@%%'
>>> non_ascii_to_octal("12ABāáâãäåβ①②", encoding="ISO-8859-4")
'12AB\\340\\341\\342\\343\\344\\345@~\\142@~@%34%\\254@%%@%34%\\255@%%'
""" # noqa: RUF002
# Return the string if it only contains printable ASCII characters from 32 to 126.
if all(32 <= ord(c) <= 126 for c in argstr):
# Return the input string if it only contains ASCII characters.
if encoding == "ascii" or all(32 <= ord(c) <= 126 for c in argstr):
return argstr

# Dictionary mapping non-ASCII characters to octal codes
Expand All @@ -232,15 +330,15 @@ def non_ascii_to_octal(argstr: str) -> str:
mapping.update(
{c: f"@%34%\\{i:03o}@%%" for i, c in charset["ZapfDingbats"].items()}
)
# Adobe ISOLatin1+ charset. Put at the end.
mapping.update({c: f"\\{i:03o}" for i, c in charset["ISOLatin1+"].items()})
# ISOLatin1+ or ISO-8859-x charset.
mapping.update({c: f"\\{i:03o}" for i, c in charset[encoding].items()})

# Remove any printable characters
mapping = {k: v for k, v in mapping.items() if k not in string.printable}
return argstr.translate(str.maketrans(mapping))


def build_arg_list(
def build_arg_list( # noqa: PLR0912
kwdict: dict[str, Any],
confdict: dict[str, str] | None = None,
infile: str | pathlib.PurePath | Sequence[str | pathlib.PurePath] | None = None,
Expand Down Expand Up @@ -310,6 +408,10 @@ def build_arg_list(
... )
... )
['f1.txt', 'f2.txt', '-A0', '-B', '--FORMAT_DATE_MAP=o dd', '->out.txt']
>>> build_arg_list(dict(B="12ABāβ①②"))
['-B12AB\\340@~\\142@~@%34%\\254@%%@%34%\\255@%%', '--PS_CHAR_ENCODING=ISO-8859-4']
>>> build_arg_list(dict(B="12ABāβ①②"), confdict=dict(PS_CHAR_ENCODING="ISO-8859-5"))
['-B12AB\\340@~\\142@~@%34%\\254@%%@%34%\\255@%%', '--PS_CHAR_ENCODING=ISO-8859-5']
>>> print(build_arg_list(dict(R="1/2/3/4", J="X4i", watre=True)))
Traceback (most recent call last):
...
Expand All @@ -324,11 +426,22 @@ def build_arg_list(
elif value is True:
gmt_args.append(f"-{key}")
elif is_nonstr_iter(value):
gmt_args.extend(non_ascii_to_octal(f"-{key}{_value}") for _value in value)
gmt_args.extend(f"-{key}{_value}" for _value in value)
else:
gmt_args.append(non_ascii_to_octal(f"-{key}{value}"))
gmt_args.append(f"-{key}{value}")

# Convert non-ASCII characters (if any) in the arguments to octal codes
encoding = _check_encoding("".join(gmt_args))
if encoding != "ascii":
gmt_args = [non_ascii_to_octal(arg, encoding=encoding) for arg in gmt_args]
Comment on lines +433 to +436
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block assumes that there will only be one encoding type? To support multiple encodings in different arguments, maybe change it to something like this (untested):

Suggested change
# Convert non-ASCII characters (if any) in the arguments to octal codes
encoding = _check_encoding("".join(gmt_args))
if encoding != "ascii":
gmt_args = [non_ascii_to_octal(arg, encoding=encoding) for arg in gmt_args]
# Convert non-ASCII characters (if any) in the arguments to octal codes
for i, arg in enumerate(gmt_args):
encoding = _check_encoding("".join(gmt_args))
if encoding != "ascii":
gmt_args[i] = non_ascii_to_octal(arg, encoding=encoding)

But this can be done in a separate PR perhaps. I'm not sure how common it is to mix encodings in different arguments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To support multiple encodings in different arguments, maybe change it to something like this (untested):

It's not possible in GMT. For each GMT module call, we can only pass --PS_CHAR_ENCODING once, so it means that all the arguments in a GMT call must be in the same encoding.

For Figure.text, the text string and any arguments also must use the same encoding.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If people use mixed encodings in different arguments, the default ISOLatin1+ encoding is used.

gmt_args = sorted(gmt_args)

# Set --PS_CHAR_ENCODING=encoding if necessary
if encoding not in {"ascii", "ISOLatin1+"} and not (
confdict and "PS_CHAR_ENCODING" in confdict
):
gmt_args.append(f"--PS_CHAR_ENCODING={encoding}")

if confdict:
gmt_args.extend(f"--{key}={value}" for key, value in confdict.items())

Expand Down
31 changes: 21 additions & 10 deletions pygmt/src/text.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
from pygmt.clib import Session
from pygmt.exceptions import GMTInvalidInput
from pygmt.helpers import (
_check_encoding,
build_arg_list,
data_kind,
fmt_docstring,
Expand Down Expand Up @@ -59,13 +60,12 @@ def text_( # noqa: PLR0912
- ``x``/``y``, and ``text``
- ``position`` and ``text``

The text strings passed via the ``text`` parameter can contain ASCII
characters and non-ASCII characters defined in the ISOLatin1+ encoding
(i.e., IEC_8859-1), and the Symbol and ZapfDingbats character sets.
See :gmt-docs:`reference/octal-codes.html` for the full list of supported
non-ASCII characters.
The text strings passed via the ``text`` parameter can contain ASCII characters and
non-ASCII characters defined in the Adobe ISOLatin1+, Adobe Symbol, Adobe
ZapfDingbats and ISO-8859-x (x can be 1-11, 13-16) encodings. Refer to
:doc:`techref/encodings` for the full list of supported non-ASCII characters.

Full option list at :gmt-docs:`text.html`
Full option list at :gmt-docs:`text.html`.

{aliases}

Expand Down Expand Up @@ -226,13 +226,24 @@ def text_( # noqa: PLR0912
kwargs["t"] = ""

# Append text at last column. Text must be passed in as str type.
confdict = {}
if kind == "vectors":
extra_arrays.append(
np.vectorize(non_ascii_to_octal)(np.atleast_1d(text).astype(str))
)
text = np.atleast_1d(text).astype(str)
encoding = _check_encoding("".join(text))
if encoding != "ascii":
text = np.vectorize(non_ascii_to_octal, excluded="encoding")(
text, encoding=encoding
)
extra_arrays.append(text)

if encoding not in {"ascii", "ISOLatin1+"}:
confdict = {"PS_CHAR_ENCODING": encoding}

with Session() as lib:
with lib.virtualfile_in(
check_kind="vector", data=textfiles, x=x, y=y, extra_arrays=extra_arrays
) as vintbl:
lib.call_module(module="text", args=build_arg_list(kwargs, infile=vintbl))
lib.call_module(
module="text",
args=build_arg_list(kwargs, infile=vintbl, confdict=confdict),
)
5 changes: 5 additions & 0 deletions pygmt/tests/baseline/test_text_nonascii_iso8859.png.dvc
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
outs:
- md5: a0f35a1d58c95e6589c7397e7660e946
size: 17089
hash: md5
path: test_text_nonascii_iso8859.png
13 changes: 13 additions & 0 deletions pygmt/tests/test_text.py
Original file line number Diff line number Diff line change
Expand Up @@ -434,3 +434,16 @@ def test_text_quotation_marks():
fig.basemap(projection="X4c/2c", region=[0, 4, 0, 2], frame=0)
fig.text(x=2, y=1, text='\\234 ‘ ’ " “ ”', font="20p") # noqa: RUF001
return fig


@pytest.mark.mpl_image_compare
def test_text_nonascii_iso8859():
"""
Test passing text strings with non-ascii characters in ISO-8859-4 encoding.
"""
fig = Figure()
fig.basemap(region=[0, 10, 0, 10], projection="X10c", frame=["WSEN+tAāáâãäåB"])
fig.text(position="TL", text="position-text:1ÉĘËĖ2")
fig.text(x=1, y=1, text="xytext:1éęëė2")
fig.text(x=[5, 5], y=[3, 5], text=["xytext1:ųúûüũūαζ∆❡", "xytext2:íîī∑π∇✉"])
return fig
Loading