Support non-ASCII characters in PyGMT arguments and text in Figure.text #2204

seisman · 2022-11-23T03:05:35Z

Problems

Due to the limitation of the PostScript language, GMT can only work with ASCII characters and a small set of non-ASCII characters. See https://docs.generic-mapping-tools.org/latest/cookbook/octal-codes.html for the full list of characters that PostScript/GMT/PyGMT can accept.

These non-ASCII characters must be specified using their octal codes or character escape sequence. A few non-ASCII characters (e.g., ü, Î) are allowed and GMT can substitute these non-ASCII characters with the correct PostScript octal codes.

Users who don't know the limitations may pass non-ASCII characters directly in the arguments. For example:

import pygmt
fig = pygmt.Figure()
fig.basemap(region=[0, 10, 0, 5], projection="x1c", frame="WSen+tTime (s) vs Distance (°)")
fig.show()

The above script produces this "surprising" figure:

So, if users want to add a non-ASCII character to a plot, they must know the limitations and have to go to this page https://docs.generic-mapping-tools.org/latest/cookbook/octal-codes.html, look for the character in the four tables, and figure out the corresponding octal code (\260 for the symbol °), which is tedious and not easy.

After finding the octal code, users may think changing ° to \260 should work:

import pygmt
fig = pygmt.Figure()
fig.basemap(region=[0, 10, 0, 5], projection="x1c", frame="WSen+tTime (s) vs Distance (\260)")
fig.show()

but it still produces the same "surprising" figure, because the Python interpreter recognizes \260 first, and converts it to ° before passing it to the GMT API. So, users have to use double backslashes or raw strings:

frame="WSen+tTime (s) vs Distance (\\260)"

or

frame=r"WSen+tTime (s) vs Distance (\260)"

Solutions

Since Python works well with non-ASCII characters (acutally it works with any unicode characters), it's possible to pass ° in Python, and PyGMT should substitute the non-ASCII characters with the corresponding octal codes.

Here are some tests in Python:

# Python support non-ASCII characters
>>> "WSen+tTime (s) vs Distance (°)"
'WSen+tTime (s) vs Distance (°)'

# Python knows how to convert \260 to °
>>> "WSen+tTime (s) vs Distance (\260)"
'WSen+tTime (s) vs Distance (°)'

# replace ° with \\260
>>> "WSen+tTime (s) vs Distance (°)".replace("°", "\\260")
'WSen+tTime (s) vs Distance (\\260)'

# how to convert ° to \\260. It should work for other non-ASCII characters
>>> oct(ord("°")).replace("0o", '\\')
'\\260'

So, if we can do the substitutions/conversions internally, we can support non-ASCII characters better. The simplest solution is to define a big dictionary that maps non-ASCII characters (e.g., °) to octal codes (e.g., \260). Better and more clever solutions are also possible.

Notes about the possible limitations of the solutions

Non-ASCII characters can be used in many cases:

PyGMT arguments, e.g., frame="WSen+tTime (s) vs Distance (°)"
Text strings as input data, e.g., fig.text(x=0, y=0, text="Distance (°)")
Text strings in a plaintext file, e.g., a plaintext file with a record like 0 0 Distance (°)

The above solution should work well for case 1, may work or not work (depending on the implentation)
for case 2, and likely don't work for case 3.

Are you willing to help implement and maintain this feature?

Yes, but more discussions are needed.

The text was updated successfully, but these errors were encountered:

seisman · 2023-08-21T10:56:00Z

TODO list after PR #2584:

Check PS_CHAR_ENCODING so it also works with Standard+ character set?
Check arguments that are not processed by build_arg_string (e.g., text in Figure.text) Figure.text: Support non-ASCII characters in the 'text' parameter #2638
Add a gallery example or a tutorial? Add a tutorial for typesetting non-ASCII characters #3389

weiji14 · 2023-08-25T06:46:58Z

Was trying to get the character ā (Latin small letter a with macron) to plot using either ISO-8859-4/ISO-8859-10/ISO-8859-13 in #2641 (comment), but doesn't work when setting pygmt.config(PS_CHAR_ENCODING="ISO-8859-4"), because we need to use --PS_CHAR_ENCODING inline according to https://docs.generic-mapping-tools.org/6.4/gmt.conf.html#term-PS_CHAR_ENCODING:

Note: Normally the character set is written as part of the PostScript header. If you need to switch to another character set for a later overlay then you must use --PS_CHAR_ENCODING=encoding on the command line and not via gmt gmtset.

Workaround was to use the composite character @!a\225 following https://docs.generic-mapping-tools.org/6.4/tutorial/session-2.html#plotting-text-strings. I'm not sure if it's worth adding --PS_CHAR_ENCODING as an option to plotting methods to make it easier. I think we discussed a while ago not to support double-dash -- inline options?

seisman · 2024-04-26T15:54:06Z

After PRs #2584, #2638, #3192, and #3199, PyGMT already provides basic support for non-ASCII characters.

In short, we're maintaining a big dictionary mapping non-ASCII characters to their octal codes. So users can pass a character like ɑ (alpha) and PyGMT will map it to @~\\141@~. There is no direct way to type ɑ using a keyboard (maybe there are shortcuts, but who can remember them all?), so users usually need to copy and paste from another source. However, many characters look similar. For example:

In [19]: "Ω" == "Ω"
Out[19]: False

In [20]: "Δ" == "∆"
Out[20]: False

In [21]: import unicodedata

In [22]: unicodedata.name("Ω")
Out[22]: 'OHM SIGN'

In [23]: unicodedata.name("Ω")
Out[23]: 'GREEK CAPITAL LETTER OMEGA'

In [24]: unicodedata.name("Δ")
Out[24]: 'GREEK CAPITAL LETTER DELTA'

In [25]: unicodedata.name("∆")
Out[25]: 'INCREMENT'

Since these characters are so similar, users may use the "incorrect" one and then get surprising results. Actually, we're using some incorrect characters in our mapping dictionary.

To solve the problem, we need character tables that users can copy. The official GMT documentation provides the tables (https://docs.generic-mapping-tools.org/dev/reference/octal-codes.html) in PNG/PDF format but they're not easy to copy. Better tables are available at

However, these tables are "incomplete" (some characters are missing) compared to the GMT ones. For example, in the Symbol table, \322 is ©, but in https://www.compart.com/en/unicode/charsets/Adobe-Symbol-Encoding, it's mapped to Unicode character U+F6DA, which belongs to the "Private Use Area" block (I guess there must be some historical reasons behind). So, instead of using the private, invisible U+F6DA, we should map © (U+00A9) to @~\\322@~. It also means we need to maintain our character tables.

This repository https://github.com/seisman/GMT-octal-codes maintains the mapping files that can map Unicode characters to GMT octal codes. Check the README files in that repository for how the mapping files are created.

With the well-maintained mapping files, we can refactor the mapping dictionary in the PyGMT project and add character tables for the supported encodings, as done in #3206.

seisman · 2024-10-28T02:52:48Z

@GenericMappingTools/pygmt-maintainers

After a series of PRs, I think we already provide good support for non-ASCII characters.

Things to note are:

We're assuming that users are using the ISOLatin1+ encoding by default
The "Standard" or "Standard+" encodings are not supported

Supporting "Standard"/"Standard+" or allowing any default encoding can be tricky because we need to inquire about the current character encoding using lib.get_default["PS_CHAR_ENCODING"] in some places (e.g., in the non_ascii_to_octal function).

We have some options:

Do nothing. Then we can close this issue.
In pygmt.begin(), check the current PS_CHAR_ENCODING. Raise a warning if PS_CHAR_ENCODING!=ISOLatin1+ and set PS_CHAR_ENCODING=ISOLatin1+. Assuming that users don't change PS_CHAR_ENCODING in the middle of a script.
Try hard to support "Standard"/"Standard+" encodings, which likely means we need to inquire the current character encoding in non_ascii_to_octal/_check_encoding, or we have to keep track of the changes of character encoding in a global variable.

I prefer option 2 and will work on it if there are no objections or comments in one week.

seisman · 2024-11-12T15:54:19Z

Acutally, there is option 4, i.e., adding --PS_CHAR_ENCODING=<encoding> even if the text string encoding is ISOLatin1+. I have implemented this option in #3611. Please see that PR for details.

seisman added the feature request New feature wanted label Nov 23, 2022

seisman mentioned this issue Nov 24, 2022

Add a tutorial explaining the usage of octal codes #2000

Closed

seisman mentioned this issue Jun 24, 2023

Support non-ASCII characters in function arguments #2584

Merged

14 tasks

seisman closed this as completed in #2584 Aug 21, 2023

seisman added this to the 0.10.0 milestone Aug 21, 2023

seisman reopened this Aug 21, 2023

seisman mentioned this issue Aug 21, 2023

Figure.text: Support non-ASCII characters in the 'text' parameter #2638

Merged

7 tasks

seisman modified the milestones: 0.10.0, 0.11.0 Sep 1, 2023

seisman added documentation Improvements or additions to documentation help wanted Helping hands are appreciated labels Nov 3, 2023

seisman modified the milestones: 0.11.0, 0.12.0 Dec 19, 2023

seisman removed this from the 0.12.0 milestone Feb 26, 2024

seisman changed the title ~~Better support of non-ASCII characters in PyGMT arguments~~ Support non-ASCII characters in PyGMT arguments and text in Figure.text Apr 26, 2024

seisman mentioned this issue Jun 19, 2024

Refactor to improve the user experience with non-ASCII characters #3206

Merged

seisman mentioned this issue Jul 4, 2024

Support non-ASCII characters in ISO-8859-x charsets #3310

Merged

seisman mentioned this issue Aug 11, 2024

Add a tutorial for typesetting non-ASCII characters #3389

Merged

seisman added this to the 0.14.0 milestone Oct 28, 2024

seisman removed the help wanted Helping hands are appreciated label Oct 28, 2024

seisman removed the documentation Improvements or additions to documentation label Oct 28, 2024

seisman linked a pull request Nov 12, 2024 that will close this issue

Ensure non-ASCII characters are typeset correctly even if PS_CHAR_ENCODING is not 'ISOLatin1+' #3611

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support non-ASCII characters in PyGMT arguments and text in Figure.text #2204

Support non-ASCII characters in PyGMT arguments and text in Figure.text #2204

seisman commented Nov 23, 2022 •

edited

Loading

seisman commented Aug 21, 2023 •

edited

Loading

weiji14 commented Aug 25, 2023

seisman commented Apr 26, 2024 •

edited

Loading

seisman commented Oct 28, 2024 •

edited

Loading

seisman commented Nov 12, 2024

Support non-ASCII characters in PyGMT arguments and text in Figure.text #2204

Support non-ASCII characters in PyGMT arguments and text in Figure.text #2204

Comments

seisman commented Nov 23, 2022 • edited Loading

Problems

Solutions

Notes about the possible limitations of the solutions

seisman commented Aug 21, 2023 • edited Loading

weiji14 commented Aug 25, 2023

seisman commented Apr 26, 2024 • edited Loading

seisman commented Oct 28, 2024 • edited Loading

seisman commented Nov 12, 2024

seisman commented Nov 23, 2022 •

edited

Loading

seisman commented Aug 21, 2023 •

edited

Loading

seisman commented Apr 26, 2024 •

edited

Loading

seisman commented Oct 28, 2024 •

edited

Loading