Skip to content

Commit

Permalink
Change wording from UTF-16 to Unicode
Browse files Browse the repository at this point in the history
I believe the wording of the Protocol Guide confuses Unicode and
character encodings such as UTF-16.

Citing ECMA-404, chapter 1 "Scope":

>JSON syntax describes a sequence of Unicode code points.

Citing ECMA-404, chapter 9 "String":

>A string is a sequence of Unicode code points wrapped with quotation marks (U+0022).

JSON is by definition a format which is a sequence of Unicode code
points. Fields of this format do not have any character encoding
associated with them at the conceptual level. It is only when being
serialized eg. for transport over the wire this sequence of Unicode
character is encoded using a specific character encoding.

Talking about a specific UTF encoding of a JSON field and then referring
to string length in code points is confusing. The wording seems to imply
that this specific field is serialised differently from the entire JSON
sequence. This is impossible.

Morover the fact that this JSON is then encoded using UTF-16 is
irrelevant to the remark about the length of this field and already
covered by this sentence:

>The canonical format is defined by the ECMA-262 6th Edition section
>JSON.stringify. For an example, see how the above message is formatted.

I decided to replace the phase "UTF-16" with "Unicode" instead of
removing it to make sure that the phrase "code units" is explicit.
  • Loading branch information
boreq committed Apr 13, 2022
1 parent c04d813 commit 7ba2a0c
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion index.html
Original file line number Diff line number Diff line change
Expand Up @@ -859,7 +859,7 @@ <h3 id="message-format">Message format</h3>
</tr>
<tr>
<td>content</td>
<td>If the message is not encrypted, This is a dictionary containing free-form data for applications to interpret, plus a mandatory <em>type</em> field. The <em>type</em> field allows applications to filter out message types they don’t understand and must be a UTF-16 string between 3 and 52 code units long (inclusive). If the message is encrypted, then this is a base64 encoded string, followed by a suffix of <code>.box</code>; we will describe private messages later in this document.</td>
<td>If the message is not encrypted, This is a dictionary containing free-form data for applications to interpret, plus a mandatory <em>type</em> field. The <em>type</em> field allows applications to filter out message types they don’t understand and must be a Unicode string between 3 and 52 code units long (inclusive). If the message is encrypted, then this is a base64 encoded string, followed by a suffix of <code>.box</code>; we will describe private messages later in this document.</td>
</tr>
</table>
<aside style="align-self: start; position: relative; top: 19px;">
Expand Down

0 comments on commit 7ba2a0c

Please sign in to comment.