-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support non-ASCII characters in ISO-8859-x charsets #3310
Conversation
Summary of changed imagesThis is an auto-generated report of images that have changed on the DVC remote
Image diff(s)Report last updated at commit 3474bb2 |
Co-authored-by: Michael Grund <[email protected]>
Co-authored-by: Yvonne Fröhlich <[email protected]>
pygmt/helpers/utils.py
Outdated
return encoding | ||
# Return the "ISOLatin1+" encoding if the string contains characters from multiple | ||
# charset encodings or contains characters that are not in any charset encoding. | ||
return "ISOLatin1+" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function returns ISOLatin1+
/ISO-8859-1
/.../ISO-8859-16
, but Python's standard encodings are latin_1
/iso8859_1
/...iso8859_16
(https://docs.python.org/3/library/codecs.html#standard-encodings). They're inconsistent, but using names like ISOLatin1+
can greatly simplify the codes.
pygmt/helpers/utils.py
Outdated
'ISOLatin1+' | ||
>>> check_encoding("123AB中文") # Characters not in any charset encoding | ||
'ISOLatin1+' | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to the GMT-supported encodings, I'm thinking if we should return ascii
(the name ascii
comes from https://docs.python.org/3/library/codecs.html#standard-encodings) if the input string only contains ASCII characters, e.g.,
if all(32 <= ord(c) <= 126 for c in argstr):
return "ascii"
Since in most cases, the arguments and the text string contain ASCII characters only. When ascii
encoding is detected, we no longer need to apply non_ascii_to_octal
to the strings, which may improve the performance for most cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 2d01f6b.
CodSpeed Performance ReportMerging #3310 will not alter performanceComparing Summary
|
Ping @GenericMappingTools/pygmt-maintainers for reviews. |
I'll merge this PR in 24 hours if there are no further comments. |
pygmt/helpers/utils.py
Outdated
@@ -192,7 +192,58 @@ def data_kind( | |||
return kind | |||
|
|||
|
|||
def non_ascii_to_octal(argstr: str) -> str: | |||
def check_encoding(argstr: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be too much to have a typehint like this?
def check_encoding(argstr: str) -> str: | |
def check_encoding(argstr: str) -> Literal[ | |
"ascii", | |
"ISOLatin1+", | |
"ISO-8859-1", | |
"ISO-8859-2", | |
"ISO-8859-3", | |
"ISO-8859-4", | |
"ISO-8859-5", | |
"ISO-8859-6", | |
"ISO-8859-7", | |
"ISO-8859-8", | |
"ISO-8859-9", | |
"ISO-8859-10", | |
"ISO-8859-11", | |
"ISO-8859-13", | |
"ISO-8859-14", | |
"ISO-8859-15", | |
"ISO-8859-16", | |
"ISO-8859-17", | |
]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking about that too. Actually the same types can be used in typing hints the encoding
parameter of the non_ascii_to_octal
function. So my initial plan is to define a generic type
type Encodings = Literal[
"ascii",
"ISOLatin1+",
"ISO-8859-1",
"ISO-8859-2",
"ISO-8859-3",
"ISO-8859-4",
"ISO-8859-5",
"ISO-8859-6",
"ISO-8859-7",
"ISO-8859-8",
"ISO-8859-9",
"ISO-8859-10",
"ISO-8859-11",
"ISO-8859-13",
"ISO-8859-14",
"ISO-8859-15",
"ISO-8859-16",
]
and then use it like:
def check_encoding(argstr: str) -> Encodings:
def non_ascii_to_octal(argstr: str, encoding: Encodings = "ISOLatin1+") -> str:
but mypy v1.10.0 complains
pygmt/helpers/utils.py:21: error: PEP 695 type aliases are not yet supported [valid-type]
mypy v1.11.0 started to support PEP 695 but it was released just a few days ago (https://mypy-lang.blogspot.com/2024/07/mypy-111-released.html) and the PEP 695 support is still experimental.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 7607c7e.
pygmt/helpers/__init__.py
Outdated
@@ -19,6 +19,7 @@ | |||
args_in_kwargs, | |||
build_arg_list, | |||
build_arg_string, | |||
check_encoding, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on using _check_encoding
to keep this function more private? I know we don't document pygmt.helpers.utils
in the API docs, but want to avoid users from thinking that this function is somewhat public if there's no leading underscore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 6728856.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pygmt/helpers/utils.py
Outdated
@@ -192,17 +192,113 @@ def data_kind( | |||
return kind | |||
|
|||
|
|||
def non_ascii_to_octal(argstr: str) -> str: | |||
def _check_encoding( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this private function near the top (together with _validate_data_input
)?
# Convert non-ASCII characters (if any) in the arguments to octal codes | ||
encoding = _check_encoding("".join(gmt_args)) | ||
if encoding != "ascii": | ||
gmt_args = [non_ascii_to_octal(arg, encoding=encoding) for arg in gmt_args] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block assumes that there will only be one encoding type? To support multiple encodings in different arguments, maybe change it to something like this (untested):
# Convert non-ASCII characters (if any) in the arguments to octal codes | |
encoding = _check_encoding("".join(gmt_args)) | |
if encoding != "ascii": | |
gmt_args = [non_ascii_to_octal(arg, encoding=encoding) for arg in gmt_args] | |
# Convert non-ASCII characters (if any) in the arguments to octal codes | |
for i, arg in enumerate(gmt_args): | |
encoding = _check_encoding("".join(gmt_args)) | |
if encoding != "ascii": | |
gmt_args[i] = non_ascii_to_octal(arg, encoding=encoding) |
But this can be done in a separate PR perhaps. I'm not sure how common it is to mix encodings in different arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To support multiple encodings in different arguments, maybe change it to something like this (untested):
It's not possible in GMT. For each GMT module call, we can only pass --PS_CHAR_ENCODING
once, so it means that all the arguments in a GMT call must be in the same encoding.
For Figure.text
, the text
string and any arguments also must use the same encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If people use mixed encodings in different arguments, the default ISOLatin1+
encoding is used.
This PR enhances PyGMT to support non-ASCII characters in ISO-8859-x charsets (Refer to https://en.wikipedia.org/wiki/ISO/IEC_8859-1, https://en.wikipedia.org/wiki/ISO/IEC_8859-2 and more for available characters in each encoding). Related to #2204.
In GMT, the default encoding is
ISOLatin1+
(orStandard+
but it's not supported in PyGMT yet). To use non-ASCII characters in ISO-8859-x, we need to use--PS_CHAR_ENCODING=<encoding>
according to https://docs.generic-mapping-tools.org/dev/gmt.conf.html#term-PS_CHAR_ENCODING. Please note that we can only use one encoding in a single GMT module call, which means mixing characters from different encodings will cause trouble. So, the implementation is simple. We just need to check the argument string orFigure.text
's text string and determine the charset encoding to use, and then add--PS_CHAR_ENCODING=<encoding>
when calling modules.This PR does a few things:
pygmt/encodings.py
check_encoding
to check the encoding of a string.encoding
parameter tonon_ascii_to_otcal
so that we know the encoding to use when converting non-ascii.Figure.text
'stext
parameter support ISO-8859-x charsets.Preview : https://pygmt-dev--3310.org.readthedocs.build/en/3310/techref/encodings.html
Here is an example showing how it works with non-ASCII characters in ISO-8859-4. The same script is added as a test.
The output is
The script below shows all the characters in different ISO-8859-x encodings. I don't check them all carefully, but most look correct. The exceptions are ISO-8859-6, ISO-8859-8, and ISO-8859-11. They don't work with standard fonts (https://docs.generic-mapping-tools.org/dev/gmt.conf.html#term-PS_CHAR_ENCODING) but I think it's OK because whoever wants to use these characters should have the fonts installed. Some characters like
Ṁ
in ISO-8859-14 are missing in the figure below and they also can't be shown correctly in GMT CLI. Looking at the charset definition files in the GMT source codes, the characterṀ
has a PS name like/uni1E40
(https://github.com/GenericMappingTools/gmt/blob/0584d620d27de5ed93a90e4b0dc56a2edb2d568a/src/PSL_ISO-8859-14.h#L24C43-L24C43), but it doesn't seem like a valid PS name. So, it must be upstream issues if any.