Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Display of Indic text (esp. Hindi in Devanagari script) #453

Open
StevanWhite opened this issue Sep 27, 2017 · 25 comments
Open

Display of Indic text (esp. Hindi in Devanagari script) #453

StevanWhite opened this issue Sep 27, 2017 · 25 comments

Comments

@StevanWhite
Copy link

StevanWhite commented Sep 27, 2017

In the display of Hindi text (in Devanagari script), I'm encouraged to see that much of the complex reordering of letters (Indic font shaping) is being carried out.

There are a couple of ugly bugs however. They both seem to occur randomly in the text. The effect is independent of the font being used for display (provided a font that provides Devanagari is installed!) I have been unable to identify anything in the text itself that consistenly triggers the problem -- the same word will look fine here, but show trash there.

This is not a matter of simple encoding -- otherwise no Hindi would appear at all. It appears to be some glitch in the software.
udhr_hin.txt
consolez-udhr-hindi

A lot of people are looking for a good replacement for the system consoles that displays Indic text correctly. And there aren't many options out there.

Expected Behavior

Nice clean Hindi text to be displayed

Actual Behavior

Most of the text looks pretty good. But there are glitches throughout. I see two kinds:

  1. a sequence of two or three U+FFFD ("replacement character") glyphs appear, sometimes between words. I can only imagine this is some sort of bug.

  2. a U+25CC ("dotted circle") appears, often within a word. Sometimes it appears to replace a letter... a letter which is displayed just fine in the next word.

The attached image is a screen shot of the end of the attached text file

This word appears several times under the heading अनुच्छेद ३०
किसी
In its second occurrance, there is a dotted circle before the i-glyph.
but the i-glyph is in the right place.

Under अनुच्छेद २९.
in part (१) the word
का
gets a couple of unknown characters before it. Don't see anything else broken.
In part (२) the word
प्रजातन्त्रात्मक
is broken: replaced by unknown characters, looks like the letter त is lost.

Steps to reproduce

  1. What steps consistently cause this behavior?

Make sure some fonts are installed that support Devanagari, copy the attached file udhr_hin.txt to the main user directory, open ConsoleZ.
chcp 65001
type udhr_hin.txt

  1. What does someone need to do to recreate the issue?
    -- as above

Diagnostic Report

When reporting a bug you must provide a diagnostic report.
If you are not able to create a diagnostic report, explain why.
Privacy is not a valid explanation! The report is human readable and private data can be masked.

ConsoleZ aero amd64 1.18.1.17087
Windows 10 (64 bits) [10.0.14393]
is elevated? no
UAC prefix "Administrator: "
Tab (active): ConsoleZ
  View (active): 1. ConsoleZ
  is elevated? no
  input code page 65001
  output code page 65001
  Windows console font
font index: 0
face name: Consolas
font familly: 54 MODERN vector true type
font weight: 400
font size: width=8 height=16
max window size: cols=85 rows=56
  ConsoleZ font
GetFontLanguageInfo returns 262144.
The font contains extra glyphs not normally accessible using the code page.

Monitors 1
+ Flags 1  primary
  DeviceID \\?\DISPLAY#Default_Monitor#4&2abfaa30&0&UID0#{e6f07b5f-ee97-4a90-b076-33f57bf4eaa7}
  DeviceKey \Registry\Machine\System\CurrentControlSet\Control\Class\{4d36e96e-e325-11ce-bfc1-08002be10318}\0001
  DeviceName \\.\DISPLAY1\Monitor0
  DeviceString Generic Non-PnP Monitor
  StateFlags 3
  Rect (0,0)x(1280,960)
  Work (0,0)x(1280,920)
  DPI (per monitor: yes) X=96 Y=96
System dpi 96
System metrics
  SM_CXSMICON        16
  SM_CYSMICON        16
  SM_CXICON          32
  SM_CYICON          32
  SM_CXVIRTUALSCREEN 1280
  SM_CYVIRTUALSCREEN 960
  SM_CYVIRTUALSCREEN 960
Settings file C:\Users\stevan white\AppData\Roaming\Console\console.xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<settings lang="auto" icon_size="0">
	<console shell="" init_dir="" refresh="100" change_refresh="10" rows="25" columns="80" buffer_rows="500" buffer_columns="0" start_hidden="0" save_size="0">
		<colors background_text_opacity="255">
			<color id="0" r="0" g="0" b="0"/>
			<color id="1" r="0" g="0" b="128"/>
			<color id="2" r="0" g="150" b="0"/>
			<color id="3" r="0" g="150" b="150"/>
			<color id="4" r="170" g="25" b="25"/>
			<color id="5" r="128" g="0" b="128"/>
			<color id="6" r="128" g="128" b="0"/>
			<color id="7" r="192" g="192" b="192"/>
			<color id="8" r="128" g="128" b="128"/>
			<color id="9" r="0" g="100" b="255"/>
			<color id="10" r="0" g="255" b="0"/>
			<color id="11" r="0" g="255" b="255"/>
			<color id="12" r="255" g="50" b="50"/>
			<color id="13" r="255" g="0" b="255"/>
			<color id="14" r="255" g="255" b="0"/>
			<color id="15" r="255" g="255" b="255"/>
		</colors>
		<cursor style="0" r="255" g="255" b="255"/>
		<background type="0" r="0" g="0" b="0">
			<image file="" relative="0" extend="0" position="0">
				<tint opacity="0" r="0" g="0" b="0"/></image></background></console>
	<appearance>
		<font name="FreeMono" size="14" extra_width="0" bold="0" italic="0" smoothing="0" bold_intensified="0" italic_intensified="0" ligature="0">
			<color use="0" r="0" g="0" b="0"/></font>
		<window title="ConsoleZ" icon="" main_title_format="?U([%u] )?N({%u} )%m?s( - %s)" tab_title_format="%n. ?U([%u] )?N({%u} )%t?s( - %s)" use_tab_icon="1" use_tab_title="1" trim_tab_titles="20" trim_tab_titles_right="0"/>
		<fullscreen start_in_fullscreen="0" fullscreen_monitor="0">
			<controls show_menu="1" show_toolbar="1" show_searchbar="1" show_statusbar="1" show_tabs="1" hide_single_tab="1" tabs_on_bottom="0" hide_tab_icons="0" hide_tab_close_button="0" show_scrollbars="1"/>
			<transparency type="0" active_alpha="255" inactive_alpha="255" r="0" g="0" b="0"/></fullscreen>
		<controls show_menu="1" show_toolbar="1" show_searchbar="1" show_statusbar="1" show_tabs="1" hide_single_tab="1" tabs_on_bottom="0" hide_tab_icons="0" hide_tab_close_button="0" show_scrollbars="1"/>
		<styles caption="1" resizable="1" taskbar_button="1" border="1" inside_border="2" split_bar_size="0" tray_icon="1" quake_like="0" quake_animation_time="300" jumplist="0" integrated_ime="0" hide_when_inactive="0" per_monitor_dpi="0" keep_view_theme="0">
			<selection_color r="255" g="255" b="255"/>
			<highlight_color r="191" g="191" b="191"/></styles>
		<position x="-1" y="-1" save_position="0" w="-1" h="-1" save_size="0" z_order="0" dock="-1" snap="-1" save_state="0" state="-1"/>
		<transparency type="0" active_alpha="255" inactive_alpha="255" r="0" g="0" b="0"/></appearance>
	<behavior>
		<copy_paste copy_on_select="0" clear_on_copy="1" sensitive_copy="1" no_wrap="1" eol_spaces="1" trim_spaces="1" rtf="1" copy_newline_char="0" include_left_delimiter="0" include_right_delimiter="0" left_delimiters=" ([" right_delimiters=" )]"/>
		<scroll page_scroll_rows="0"/>
		<tab_highlight flashes="3" stay_highligted="1"/>
		<environment inherit="1" sync="1"/>
		<focus follow_mouse="0"/>
		<instance allow_multi="1"/>
		<clone use_current_dir="0"/>
		<search match_case="0" match_whole_word="0"/>
		<runas use_credprov="1"/>
		<close allow_closing_last_view="0" exit_on_closing_of_last_tab="1" confirm_closing_multiple_views="1" save_workspace_on_exit="0"/></behavior>
	<hotkeys use_scroll_lock="1">
		<hotkey ctrl="1" shift="0" alt="0" extended="0" code="83" command="settings"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="112" command="help"/>
		<hotkey ctrl="0" shift="0" alt="1" extended="0" code="115" command="exit"/>
		<hotkey ctrl="1" shift="0" alt="0" extended="0" code="112" command="newtab1"/>
		<hotkey ctrl="1" shift="0" alt="0" extended="0" code="113" command="newtab2"/>
		<hotkey ctrl="1" shift="0" alt="0" extended="0" code="114" command="newtab3"/>
		<hotkey ctrl="1" shift="0" alt="0" extended="0" code="115" command="newtab4"/>
		<hotkey ctrl="1" shift="0" alt="0" extended="0" code="116" command="newtab5"/>
		<hotkey ctrl="1" shift="0" alt="0" extended="0" code="117" command="newtab6"/>
		<hotkey ctrl="1" shift="0" alt="0" extended="0" code="118" command="newtab7"/>
		<hotkey ctrl="1" shift="0" alt="0" extended="0" code="119" command="newtab8"/>
		<hotkey ctrl="1" shift="0" alt="0" extended="0" code="120" command="newtab9"/>
		<hotkey ctrl="1" shift="0" alt="0" extended="0" code="121" command="newtab10"/>
		<hotkey ctrl="1" shift="0" alt="0" extended="0" code="122" command="newtab11"/>
		<hotkey ctrl="1" shift="0" alt="0" extended="0" code="123" command="newtab12"/>
		<hotkey ctrl="1" shift="0" alt="0" extended="0" code="49" command="switchtab1"/>
		<hotkey ctrl="1" shift="0" alt="0" extended="0" code="50" command="switchtab2"/>
		<hotkey ctrl="1" shift="0" alt="0" extended="0" code="51" command="switchtab3"/>
		<hotkey ctrl="1" shift="0" alt="0" extended="0" code="52" command="switchtab4"/>
		<hotkey ctrl="1" shift="0" alt="0" extended="0" code="53" command="switchtab5"/>
		<hotkey ctrl="1" shift="0" alt="0" extended="0" code="54" command="switchtab6"/>
		<hotkey ctrl="1" shift="0" alt="0" extended="0" code="55" command="switchtab7"/>
		<hotkey ctrl="1" shift="0" alt="0" extended="0" code="56" command="switchtab8"/>
		<hotkey ctrl="1" shift="0" alt="0" extended="0" code="57" command="switchtab9"/>
		<hotkey ctrl="1" shift="0" alt="0" extended="0" code="48" command="switchtab10"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="switchtab11"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="switchtab12"/>
		<hotkey ctrl="1" shift="0" alt="0" extended="0" code="9" command="nexttab"/>
		<hotkey ctrl="1" shift="1" alt="0" extended="0" code="9" command="prevtab"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="movetableft"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="movetabright"/>
		<hotkey ctrl="1" shift="0" alt="0" extended="0" code="87" command="closetab"/>
		<hotkey ctrl="1" shift="0" alt="0" extended="0" code="82" command="renametab"/>
		<hotkey ctrl="1" shift="0" alt="0" extended="1" code="34" command="nextview"/>
		<hotkey ctrl="1" shift="0" alt="0" extended="1" code="33" command="prevview"/>
		<hotkey ctrl="0" shift="0" alt="1" extended="1" code="37" command="leftview"/>
		<hotkey ctrl="0" shift="0" alt="1" extended="1" code="39" command="rightview"/>
		<hotkey ctrl="0" shift="0" alt="1" extended="1" code="38" command="topview"/>
		<hotkey ctrl="0" shift="0" alt="1" extended="1" code="40" command="bottomview"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="dechsize"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="inchsize"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="decvsize"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="incvsize"/>
		<hotkey ctrl="1" shift="1" alt="0" extended="0" code="87" command="closeview"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="detachview"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="attach"/>
		<hotkey ctrl="1" shift="1" alt="0" extended="0" code="79" command="splithoriz"/>
		<hotkey ctrl="1" shift="1" alt="0" extended="0" code="69" command="splitvert"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="splitswap"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="clone"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="122" command="fullscreen"/>
		<hotkey ctrl="1" shift="0" alt="0" extended="0" code="96" command="zoom100"/>
		<hotkey ctrl="1" shift="0" alt="0" extended="0" code="107" command="zoominc"/>
		<hotkey ctrl="1" shift="0" alt="0" extended="0" code="109" command="zoomdec"/>
		<hotkey ctrl="1" shift="0" alt="0" extended="0" code="71" command="groupall"/>
		<hotkey ctrl="1" shift="1" alt="0" extended="0" code="71" command="ungroupall"/>
		<hotkey ctrl="1" shift="0" alt="0" extended="0" code="84" command="grouptab"/>
		<hotkey ctrl="1" shift="1" alt="0" extended="0" code="84" command="ungrouptab"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="clear"/>
		<hotkey ctrl="1" shift="0" alt="0" extended="1" code="45" command="copy"/>
		<hotkey ctrl="1" shift="0" alt="0" extended="0" code="65" command="selectall"/>
		<hotkey ctrl="1" shift="0" alt="0" extended="1" code="46" command="clear_selection"/>
		<hotkey ctrl="0" shift="1" alt="0" extended="1" code="45" command="paste"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="paste_selection"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="stopscroll"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="resumescroll"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="textselleft"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="textselright"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="textseltop"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="textselbottom"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="textselhome"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="textselend"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="textselpageup"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="textselpagedown"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="colselleft"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="colselright"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="colseltop"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="colselbottom"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="colselhome"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="colselend"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="colselpageup"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="colselpagedown"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="scrollrowup"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="scrollrowdown"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="scrollpageup"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="scrollpagedown"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="scrollcolleft"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="scrollcolright"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="scrollpageleft"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="scrollpageright"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="find"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="findnext"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="findprev"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="switchtransparency"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="alwaysontop"/>
		<hotkey ctrl="1" shift="1" alt="0" extended="0" code="112" command="dumpbuffer"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="cmdMenu1"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="cmdMenu2"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="cmdMenu3"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="cmdSnippets"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="ctrlC"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="wspload"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="wspsave"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="fwdmouse"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="externalcmd1"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="externalcmd2"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="externalcmd3"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="externalcmd4"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="externalcmd5"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="externalcmd6"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="externalcmd7"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="externalcmd8"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="externalcmd9"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="externalcmd10"/>
		<hotkey ctrl="0" shift="0" alt="0" extended="0" code="0" command="activate" win="0"/>
	</hotkeys>
	<external_commands>
	</external_commands>
	<mouse>
		<actions>
			<action ctrl="0" shift="0" alt="0" button="1" name="copy"/>
			<action ctrl="0" shift="1" alt="0" button="1" name="select"/>
			<action ctrl="0" shift="0" alt="0" button="0" name="select_column"/>
			<action ctrl="0" shift="0" alt="0" button="3" name="paste"/>
			<action ctrl="1" shift="0" alt="0" button="1" name="drag"/>
			<action ctrl="0" shift="0" alt="0" button="4" name="link"/>
			<action ctrl="0" shift="0" alt="0" button="2" name="menu"/>
			<action ctrl="0" shift="0" alt="0" button="0" name="menu2"/>
			<action ctrl="0" shift="0" alt="0" button="0" name="menu3"/>
			<action ctrl="0" shift="0" alt="0" button="0" name="snippets"/>
		</actions></mouse>
	<snippets dir=""/>
	<tabs>
		<tab title="ConsoleZ" use_default_icon="0">
			<console shell="" init_dir="" priority="2" run_as_user="0" user="" net_only="0" run_as_admin="0"/>
		</tab>
	</tabs></settings>
@StevanWhite
Copy link
Author

I tried it also with the beta version ConsoleZ.x64.1.18.2.17256, with similar results.

@StevanWhite
Copy link
Author

I selected the text in the window and copy-pasted it into Notepad++.
Attached is the resulting file. The unknown-character and dotted-circle glyphs are evident.

udhr_hin.out.txt

@StevanWhite
Copy link
Author

Hey. Is the text being chopped up into buffers before being fed to the underlying rendering layer?
That would have an effect like what we're seeing. If the text has to be be buffered, it must be broken at white space!

@cbucher
Copy link
Owner

cbucher commented Sep 28, 2017

ConsoleZ displays chars from console buffer.
Generally, a strange character is present in the font or in the console buffer.

Here, it seems cmd.exe and 'type' command have a curious behaviour.
With PowerShell and 'Get-Content udhr_hin.txt', the result seems correct.
Can you confirm?

@StevanWhite
Copy link
Author

StevanWhite commented Sep 28, 2017 via email

@StevanWhite
Copy link
Author

My colleague gets perfect display from the DOS 'type' command.
I have started from the beginning, experimenting with 'chcp'. It doesn't matter in PowerShell -- only the -Encoding switch makes a difference.

We are both using Windows 10, but mine is a German system, and hers -- It's English, near as I can tell.

Can you recommend a command that would list relevant environmental settings?

@StevanWhite
Copy link
Author

The mail or something removed the image of my console, showing the current state on my machine.
I said there are dotted circles on the right -- I meant the other right -- the left side.
consolez-udhr-hindi-ps

I hope you have some idea how we can debug our settings -- I think it would be a good FAQ topic.

@cbucher
Copy link
Owner

cbucher commented Sep 29, 2017

dotted circles appear when you cut a word (line return)

image

My colleague gets perfect display from the DOS 'type' command.

So you have more knowledge than me to resolve this issue.
Your colleague use ConsoleZ or not?

Check the Windows Console font used by your colleague

  • with ConsoleZ : View/ConsoleWindow to show Windows console. Right click on Windows console caption --> Properties --> Font tab
  • without ConsoleZ : Right click on Windows console caption --> Properties --> Font tab

What I can see by googling:

  • you cannot type Hindi without Unicode
  • Hindi has no ANSI codepage
  • there is no monospace font

@StevanWhite
Copy link
Author

StevanWhite commented Sep 30, 2017

Hi Christian,

I am working on this, and I have more to report, but a couple of things aren't clear to me. I'll write about that separately -- here I'll just answer your remarks and questions.

" dotted circles appear when you cut a word (line return) "

Not exactly. They happen if you cut a word before a mark character which must be applied to the character that precedes it. And this is the root of the remaining problem: some software layer is incorrectly breaking up the text before the font rendering layer gets it. What is unclear to me, is which software layer is responsible for the problem.

" Your colleague use ConsoleZ or not? "

We had a miscommunication. When she saw perfect Hindi output, she was looking at the Windows 10 bash console -- which works perfectly.
When she uses ConsoleZ, she gets the same results that I have seen.

" What I can see by googling: "

" you cannot type Hindi without Unicode "
Unicode was invented as a means of supporting all writing systems. The Devanagari writing system used by Hindi is one of those.

" Hindi has no ANSI codepage "
Not by itself. The correct codepage to use for Hindi and other Indic languages is 65001, UTF-8.

" there is no monospace font "
As you can see in my images, we do have a monospace font that supports Devanagari.
I can give you access to it, if you like.

But note: our main problem that we see ConsoleZ is independent of the font. This is a software bug, either in ConsoleZ or in Windows software.

@StevanWhite
Copy link
Author

Here is a shorter text page, just the last section of that UDHR file
udhr30.txt

Here is how it looks using the Windows bash console on the same system (same font).
winbash-udhr-hindi
There are no dotted circle here.

But notice the text output routines are automatically wrapping this text, so that no word is by the line endings.

What layer is responsible for this?

@StevanWhite
Copy link
Author

That same text in the same ConsoleZ window, using PowerShell and bash.
consolez-udhr-ps-vs-bash

Note that the results are identical -- they both break the text at the wrong point.
This means that the PowerShell 'get-content' is not to blame for this problem.

@Shreeshrii
Copy link

We had a miscommunication. When she saw perfect Hindi output, she was looking at the Windows 10 bash console -- which works perfectly.
When she uses ConsoleZ, she gets the same results that I have seen.


Clarification

When I run cmd in consolez, and type the text, it gets some characters in error.

When I run bash from WSL in consolez, and cat the text, it displays ok

So diff could also be between type and cat.

I will check whether error happens in exact same place and check the unicode points of those words.

@Shreeshrii
Copy link

I tried copying the words getting the error and looking at their codepoints in https://r12a.github.io/apps/conversion/
I find no difference, so probably the copy and paste method is dropping the offending character.

I tried looking at same text file in windows console directly in windows 10 and then also looked at the same via consolez in windows 10. Here are the results with images attached.


windows console - couriernew - indic text as boxes

windows_cmd_couriernew


windows console - with Steve's monospace font - input text rendered in devanagari - but the positioning of combining marks is incorrect

windows_cmd_freemono


windows console run via consolez - same indic text - rendered correctly but certain characters show up as ? / diamonds/ etc.

consolez_cmd


powershell run via consolez - indic text displays correctly

consolez_powershell


WSL bash under windows - run via consolez - indic text displays correctly

consolez_bash

@Shreeshrii
Copy link

In the above post , where I have said indic text is rendered correctly, I have not included the 'dotted circle' issue - that can probably be avoided by adding line breaks at word boundaries rather than in middle of word.

@StevanWhite
Copy link
Author

Hi, Shree,
A few explanations of the effects you showed. There are several things going on, and I am just now coming to understand them myself.

  1. The Courier New font has no support for Devanagari. The boxes you see are normal behavior when the glyph is not found.
  2. In Windows terminals, the default encoding may not be UTF-8. When the encoding is set to some other value, wrong characters will appear. Some of your examples show Western letters being displayed instead of Unicode Devanagari-- the chcp 65001 command fixes that, setting the encoding to UTF-8.
  3. I suspect that the old DOS commands such as "type" do bad things to multi-byte (e.g. Unicode) text. This is where you see the diamond-question (Unicode "replacement", U+FFFD) character.
  4. The dotted circle appears when a mark character has failed to apply properly to a preceding character.
    This will happen in normal Hindi text if words are somehow broken at the wrong point, either by some coding oversight (incorrect buffering), or by a poor algorithm for wrapping text.

@StevanWhite
Copy link
Author

StevanWhite commented Oct 1, 2017

Christophe, I want to say, you are almost there, as far as display of Indic text is concerned.
There are two things for you to consider for your product.

  1. encoding

The default behavior of Windows consoles regarding encoding is widely considered to be a bug. For instance:
https://stackoverflow.com/questions/22349139/utf-8-output-from-powershell

Our systems are already set for UTF-8 locale, we have set chcp 65001, and yet still in PowerShell, I had to explicitly set the encoding of Get-Content in order to obtain the right output.

Once the encoding is set, as with the default system locale, it should just work. There should be no need to re-state the encoding, as we have done. If you can figure out what is the right thing to do, you will have a terminal that is very superior to anything else on Windows.

  1. Line wrapping of Indic text

For the purpose of reading text, it is very bad to show the normally-hidden characters used to compose Indic words, or to mis-place vowel marks. Both of these are happening when you split a word at the edge of the screen.

There are a couple of options.

i) break lines only at white-space.

This is an easy solution.

For display of normal text, it is maybe the best-looking option. It wreaks havoc in applications that don't expect this behavior. (E.g. text editors such as vim.)

I would suggest an application option for white-space line wrapping. That would settle the problem for most users.

ii) in a monospace environment, break after the last rendered character.

I haven't seen this done. If it's possible at all, this would be the preferred solution for some purposes.

There may be special font-rendering API calls that facilitate this. If not, there may be an algorithm for finding the best place to break a word. It would take some thought -- although I can imagine a couple of ways this could be done.

Cheers!

@cbucher
Copy link
Owner

cbucher commented Oct 3, 2017

Our systems are already set for UTF-8 locale, we have set chcp 65001, and yet still in PowerShell, I had to explicitly set the encoding of Get-Content in order to obtain the right output.

Your file has no BOM, then there is no indication concerning the encoding.
By default, Get-Content will choose System.Text.Encoding.Default (aka operating system's current ANSI code page).

Once the encoding is set, as with the default system locale, it should just work.

You should contact Microsoft to debate this.
PowerShell is based on .Net and is Unicode only.
chcp configures code page for ANSI applications.
Why .Net uses operating system's ACP instead of current console ACP as default encoding? I don't know.

ConsoleZ reads from the Win32 console buffer.
Console applications (shells or not) write into the Win32 console buffer.

Win32 console buffer is an array (column/line) of characters (with attibutes as color) : there is no indication of line breaking.
Writers are responsible for break lines and encoding.

@StevanWhite
Copy link
Author

StevanWhite commented Oct 3, 2017

Hi Christophe,

I understand your objections. But let's distinguish between implementation details and how the thing should work.

We are aware of the BOM issue. That is a non-standard Microsoft-only convention. The indication of the encoding ought to be taken from a default, which should be somewhere in the environment. This is failing to happen. Whose fault it is -- is immaterial to the user. The question is, can you do anything to improve the situation? That is still not clear to me.

The encoding functionality is now broken -- it is not just my opinion. It it causes a lot of trouble for console users generally, and it impacts your users too. How it should work is, the default encoding should be set when the console app is launched, and cmdlets, or programs, launched within that should inherit the encoding. Maybe it is really impossible to do anything about it at the console level. If so, it would be helpful to document that somewhere, together with recommended work-arounds. (Documentation alone would set you apart from most of the competition!)

The stuff about Win32 buffer is an implementation detail, which is important for you -- not so much for your users.

Tell me: what software is responsible for breaking the line at the window edge? The "writer"? The font rendering layer? Or ConsoleZ? That software layer is the only one that can be responsible for breaking the text at a reasonable point.

@cbucher
Copy link
Owner

cbucher commented Oct 4, 2017

That is a non-standard Microsoft-only convention.

Microsoft did not invent the BOM. Unicode standard does.
I don't understand how a setting in environment could guaranty that all files you will read are encoded in UTF-8.

The question is, can you do anything to improve the situation?

As I explained : nothing. ConsoleZ doesn't read the file. ConsoleZ reads the Unicode characters from WIN32 console buffer. These characters are written by shell commands or console applications.

If you want an overview of how Windows console works, you can consult MSDN WIN32 console functions documentation.

The stuff about Win32 buffer is an implementation detail, which is important for you -- not so much for your users.

No this is not a detail. This is how ConsoleZ works and this is the only way to interact with Windows console.

Tell me: what software is responsible for breaking the line at the window edge?

A console has a fixed number of columns. When you have filled a line, the next character is written at the beginning of the next line. This is the same rule for every languages.

As you can see in my images, we do have a monospace font that supports Devanagari.
I can give you access to it, if you like.

Yes, I would like to have the name or a link to a monospace font that supports Devanagari.

@StevanWhite
Copy link
Author

Hi again Christope,

I'm afraid we're misunderstanding one another on several points.

That is a non-standard Microsoft-only convention.

Microsoft did not invent the BOM. Unicode standard does.

I didn't say Microsoft invented the BOM. I meant, only Microsoft products use the initial BOM to indicate encoding. It is not a practice recommended by the standards.
https://en.wikipedia.org/wiki/Byte_order_mark

I don't understand how a setting in environment could guaranty that all files you will read are encoded in UTF-8.

It does not. But we aren't talking about guarantees. We were talking about the "default" encoding.

The effect, for example in unix-y systems, is that plain text with no other indication of encoding is treated as if it were of the default encoding. For example, most unix-y systems these days use UTF-8 as the default encoding. If I open most text files on my system with a text editor, they assume that encoding, and display the text well.

Of course, sometimes I get a file that isn't UTF-8, and I have to figure out the encoding, and tell the editor explicitly what encoding to use. But that is beyond the scope of a console -- rather, what you call a "writer" would have to be told to use a non-default encoding.

The question is, can you do anything to improve the situation?

As I explained : nothing. ConsoleZ doesn't read the file. ConsoleZ reads the Unicode characters from WIN32 console buffer. These characters are written by shell commands or console applications.

I never said or thought that ConsoleZ reads the file. I'm afraid you've missed my point.

The idea is to somehow instruct the shell commands/applications to use the default encoding, before they write. This has nothing to do with the buffer -- by that time, it's too late. In unix-y systems, the default encoding is set in the environment, and it works right. I understand that Windows is much more complicated in this regard -- and maybe it is simply broken, from what I read. But that doesn't mean there is no way to get around it.

I only ask that you try to understand the problem, and keep your mind open to any solution that you come across.

If you want an overview of how Windows console works, you can consult MSDN WIN32 console functions documentation.

Well maybe this is the root of our misunderstanding. All the Microsoft consoles are broken regarding encoding. That does not mean that all console programs that run in Windows must be broken.

Depending on how much of the Windows API's your software uses, you might inherit the same problems that the Microsoft programs have. That is a matter of the programmer's choice.

The stuff about Win32 buffer is an implementation detail, which is important for you -- not so much for your users.

No this is not a detail. This is how ConsoleZ works and this is the only way to interact with Windows console.

I'm afraid I have failed to communicate what I mean by "implementation detail". I'll try again.

Your user wants to use your product to look at an Indic-language file. Once the issue of encoding is settled, they do see some Indic text, but there's junk in it. They know how it should look, but that's not what they see.

There are two questions:

  1. is it at all possible to display the text so it can be properly read?
  2. do you want to go to the trouble?

When you talk about "Win32 buffer" and "Windows Console", you're talking about the particular APIs and services on the system that you employ to display the text. Of course, as a programmer, you could choose to use those APIs and services, or you could write the whole thing from scratch. These are programming decisions, and that's what I mean by "implementation detail".

You write as though you identify "ConsoleZ" with this particular mechanism for displaying text. Well, it is your product, and you can define it as you like. But this does not mean it is impossible to display the text some other way.

If you insist that ConsoleZ is a program that displays text on the screen using certain system APIs and services, you may indeed be stuck. On the other hand, if you shoot for the goal of displaying text as the user would like it, you may have to abandon some ways of doing things. Some people would love a challenge like that, others would prefer to stick with what they have. It's a choice.

You might also double your number of users. (Which brings up another question: do you want more users?)

Tell me: what software is responsible for breaking the line at the window edge?

A console has a fixed number of columns. When you have filled a line, the next character is written at the beginning of the next line. This is the same rule for every languages.

I suggested that your product could be given a mode, in which it does something smarter than that, for example, to optionally break the incoming text at the last white space, and shove the remaining text on the next line. (There are other possibilities, but as I said, this is the simplest.)

This isn't a matter of possibility -- of course a programmer can find a way to achieve that effect -- it's a matter of your choice. If you want to do it, and if you are a programmer, you can do it.

As you can see in my images, we do have a monospace font that supports Devanagari.
I can give you access to it, if you like.

Yes, I would like to have the name or a link to a monospace font that supports Devanagari.

I will send you a link.

Thanks again!

@cbucher
Copy link
Owner

cbucher commented Oct 13, 2017

I meant, only Microsoft products use the initial BOM to indicate encoding.

false

For example, most unix-y systems these days use UTF-8 as the default encoding.

This is not a reason to presume that all files are UTF-8.
Even under unix-y systems, console's encoding settings indicates only to an application what it should expect when reading input data and what it should write in stdout/stderr. That's all.

Windows is an UNICODE OS (strings are encoded in UCS-2 - a subset of UTF-16 where each character takes 2 bytes in memory). When a text file is read by an application, file's content must be converted to UCS-2.

Unix-y systems are mostly C char encoding. A text file can be read like a binary file. Most applications will process data without worrying about file's encoding.

Well maybe this is the root of our misunderstanding. All the Microsoft consoles are broken regarding encoding. That does not mean that all console programs that run in Windows must be broken.

I think you presume Unix-y system are best and you critic Windows without understanding how they works. I explain only facts. There are many reasons to critic Windows and Microsoft. But there are reasons to critic Unix-y system too 😉.

There are two questions:

  1. is it at all possible to display the text so it can be properly read?
  2. do you want to go to the trouble?
  1. If you use a good font and Windows Unicode applications, it will be properly read.
    But by console design (a fixed array column/rows of characters) line returns are unavoidable.
    To avoid line breaking, application which writes output should know what is a word, verify if there is enough space to write the word, go to line if not...
    The dotted circle is automatically drawn by the glyph rendering API (aka system drawing text API).
    I am not sure to understand your problem. In a console, all non ideogrammic languages have word breaking. Nobody complains. If you see a real problem, be factual and clearly explain it.

  2. In fact, I spent my ConsoleZ weekly quota of time to read your latest commit.
    Every time I read a long comment, a new feature cannot be coded...

You might also double your number of users. (Which brings up another question: do you want more users?)

I have contributed to this project because I used it a long time and I need more features.
In GPL spirit I provide my modifications freely to all users.
It's free, really free: no ads not paid to download ...
I do not earn anything, I just lose time if the feature is useless to me.
I'm more interested in doubling the number of contributors than users.

@StevanWhite
Copy link
Author

I'm afraid you have misunderstood me on almost every point. I am sorry that I have been unable to explain the basic details to you.

I do not mean to criticise you personally, and I only point out that the problem of encoding in Windows console environments is widely considered to be buggy. I provided references -- this is not only my idea.

I too work with Windows professionally, as well as Linux, and I've been using both almost since they first came out. I'm a long-time programmer, too. Thirty years ago it was C, then C++ then Java, and meanwhile a lot of Perl and Python and so on and so on. I do know something about these things.

Most files on unix are "C" -encoded -- well that isn't really an encoding -- it basically means the file is treated, as you said, as binary. That is surely true if you're talking about system files. But for users whose language can be encoded only with Unicode, will have their default encoding set to Unicode, and most of their own text files will be Unicode-encoded, so that the textual context can be interpreted as text in their spoken language. This is the situation I've been talking about.

As we have demonstrated to you in images, Hindi text is mangled in one way or another by your program or by any other Windows terminal emulator that we can find. It is not right, sometimes so bad it's unreadable. It is possible to make a terminal emulator that does work well for Hindi and other complex scripts. It would take some thought and programming effort however.

I applaud you for your time and unpaid effort on this free project. And I respect your decision not to spend time on issues that do not interest you. Thank you for your time!

@cbucher
Copy link
Owner

cbucher commented Oct 16, 2017

(Unicode is not a text encoding, it's a consortium providing computing industry standard.
It manages UTF-8, UCS-2, ...)

I have seen only one real bug in all your comments:
type command (cmd.exe) produces some garbage in Windows console buffer.
I will ask Microsoft.

(I guess text conversion is made per block. UTF-8, Characters don't have a fixed size. Then chars overlapping two blocks could be broken.)

I don't understand why so much mystery to indicate where to find a monospace font that supports Devanagari...

@angtany
Copy link

angtany commented Jul 6, 2018

Hi StevanWhite,
I can't be able to open up a hindi text file in command prompt in windows 10 .
Even after running a command: chcp 65001(for utf-8), the rectangular boxes (as shown in the attached screenshot) are displayed instead of the Hindi text.
screenshot 36

Please suggest a solution.
Thanks in advance.

@StevanWhite
Copy link
Author

StevanWhite commented Jul 12, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants