Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Create OOB package for Rune, System.Text.Unicode.Utf8, and related APIs #52947

Open
GrabYourPitchforks opened this issue May 18, 2021 · 14 comments · May be fixed by #111292
Open

Proposal: Create OOB package for Rune, System.Text.Unicode.Utf8, and related APIs #52947

GrabYourPitchforks opened this issue May 18, 2021 · 14 comments · May be fixed by #111292
Labels
api-approved API was approved in API review, it can be implemented area-Meta in-pr There is an active PR which will close this issue when it is merged packaging Related to packaging
Milestone

Comments

@GrabYourPitchforks
Copy link
Member

GrabYourPitchforks commented May 18, 2021

Problem summary

While working on the System.Text.Encodings.Web refactoring I noticed that we have several duplicates of the System.Text.Rune type (or its backing logic) throughout our projects.

The official implementation exposed by System.Runtime:

A copy used by the Utf8String OOB:

Multiple copies used by System.Text.Encodings.Web:

Duplicated logic in System.Text.Json:

  • bool result = Utf8Parser.TryParse(source.Slice(idx + 1, 4), out int scalar, out int bytesConsumed, 'x');
    Debug.Assert(result);
    Debug.Assert(bytesConsumed == 4);
    idx += bytesConsumed; // The loop iteration will increment idx past the last hex digit
    if (JsonHelpers.IsInRangeInclusive((uint)scalar, JsonConstants.HighSurrogateStartValue, JsonConstants.LowSurrogateEndValue))
    {
    // The first hex value cannot be a low surrogate.
    if (scalar >= JsonConstants.LowSurrogateStartValue)
    {
    ThrowHelper.ThrowInvalidOperationException_ReadInvalidUTF16(scalar);
    }
    Debug.Assert(JsonHelpers.IsInRangeInclusive((uint)scalar, JsonConstants.HighSurrogateStartValue, JsonConstants.HighSurrogateEndValue));
    idx += 3; // Skip the last hex digit and the next \u
    // We must have a low surrogate following a high surrogate.
    if (source.Length < idx + 4 || source[idx - 2] != '\\' || source[idx - 1] != 'u')
    {
    ThrowHelper.ThrowInvalidOperationException_ReadInvalidUTF16();
    }
    // The source is known to be valid JSON, and hence if we see a \u, it is guaranteed to have 4 hex digits following it
    // Otherwise, the Utf8JsonReader would have alreayd thrown an exception.
    result = Utf8Parser.TryParse(source.Slice(idx, 4), out int lowSurrogate, out bytesConsumed, 'x');
    Debug.Assert(result);
    Debug.Assert(bytesConsumed == 4);
    // If the first hex value is a high surrogate, the next one must be a low surrogate.
    if (!JsonHelpers.IsInRangeInclusive((uint)lowSurrogate, JsonConstants.LowSurrogateStartValue, JsonConstants.LowSurrogateEndValue))
    {
    ThrowHelper.ThrowInvalidOperationException_ReadInvalidUTF16(lowSurrogate);
    }
    idx += bytesConsumed - 1; // The loop iteration will increment idx past the last hex digit
    // To find the unicode scalar:
    // (0x400 * (High surrogate - 0xD800)) + Low surrogate - 0xDC00 + 0x10000
    scalar = (JsonConstants.BitShiftBy10 * (scalar - JsonConstants.HighSurrogateStartValue))
    + (lowSurrogate - JsonConstants.LowSurrogateStartValue)
    + JsonConstants.UnicodePlane01StartValue;
    }
    #if BUILDING_INBOX_LIBRARY
    var rune = new Rune(scalar);
    int bytesWritten = rune.EncodeToUtf8(destination.Slice(written));
    #else
    EncodeToUtf8Bytes((uint)scalar, destination.Slice(written), out int bytesWritten);
    #endif
    Debug.Assert(bytesWritten <= 4);
    written += bytesWritten;
    }
    }
    else
    {
    destination[written++] = currentByte;
    }
    }
    }
    #if !BUILDING_INBOX_LIBRARY
    /// <summary>
    /// Copies the UTF-8 code unit representation of this scalar to an output buffer.
    /// The buffer must be large enough to hold the required number of <see cref="byte"/>s.
    /// </summary>
    private static void EncodeToUtf8Bytes(uint scalar, Span<byte> utf8Destination, out int bytesWritten)
    {
    Debug.Assert(JsonHelpers.IsValidUnicodeScalar(scalar));
    Debug.Assert(utf8Destination.Length >= 4);
    if (scalar < 0x80U)
    {
    // Single UTF-8 code unit
    utf8Destination[0] = (byte)scalar;
    bytesWritten = 1;
    }
    else if (scalar < 0x800U)
    {
    // Two UTF-8 code units
    utf8Destination[0] = (byte)(0xC0U | (scalar >> 6));
    utf8Destination[1] = (byte)(0x80U | (scalar & 0x3FU));
    bytesWritten = 2;
    }
    else if (scalar < 0x10000U)
    {
    // Three UTF-8 code units
    utf8Destination[0] = (byte)(0xE0U | (scalar >> 12));
    utf8Destination[1] = (byte)(0x80U | ((scalar >> 6) & 0x3FU));
    utf8Destination[2] = (byte)(0x80U | (scalar & 0x3FU));
    bytesWritten = 3;
    }
    else
    {
    // Four UTF-8 code units
    utf8Destination[0] = (byte)(0xF0U | (scalar >> 18));
    utf8Destination[1] = (byte)(0x80U | ((scalar >> 12) & 0x3FU));
    utf8Destination[2] = (byte)(0x80U | ((scalar >> 6) & 0x3FU));
    utf8Destination[3] = (byte)(0x80U | (scalar & 0x3FU));
    bytesWritten = 4;
    }
    }

System.Text.Json also has some duplicated logic from what turned into the System.Text.Unicode.Utf8 helper APIs:

Proposal

I propose to create a standalone Microsoft.Bcl.Unicode package which includes the functionality of the Rune type plus select helper methods. The exposed API surface would be as follows.

/*
 * These APIs are type-forwarded into System.Runtime or System.Memory on .NET Core 3.1+.
 */

namespace System.Text
{
    //
    // This is the public API surface of Rune as it existed in .NET Core 3.1.
    //
    public readonly partial struct Rune : System.IComparable<System.Text.Rune>, System.IEquatable<System.Text.Rune>
    {
        private readonly int _dummyPrimitive;
        public Rune(char ch) { throw null; }
        public Rune(char highSurrogate, char lowSurrogate) { throw null; }
        public Rune(int value) { throw null; }
        [System.CLSCompliantAttribute(false)]
        public Rune(uint value) { throw null; }
        public bool IsAscii { get { throw null; } }
        public bool IsBmp { get { throw null; } }
        public int Plane { get { throw null; } }
        public static System.Text.Rune ReplacementChar { get { throw null; } }
        public int Utf16SequenceLength { get { throw null; } }
        public int Utf8SequenceLength { get { throw null; } }
        public int Value { get { throw null; } }
        public int CompareTo(System.Text.Rune other) { throw null; }
        public static System.Buffers.OperationStatus DecodeFromUtf16(System.ReadOnlySpan<char> source, out System.Text.Rune result, out int charsConsumed) { throw null; }
        public static System.Buffers.OperationStatus DecodeFromUtf8(System.ReadOnlySpan<byte> source, out System.Text.Rune result, out int bytesConsumed) { throw null; }
        public static System.Buffers.OperationStatus DecodeLastFromUtf16(System.ReadOnlySpan<char> source, out System.Text.Rune result, out int charsConsumed) { throw null; }
        public static System.Buffers.OperationStatus DecodeLastFromUtf8(System.ReadOnlySpan<byte> source, out System.Text.Rune value, out int bytesConsumed) { throw null; }
        public int EncodeToUtf16(System.Span<char> destination) { throw null; }
        public int EncodeToUtf8(System.Span<byte> destination) { throw null; }
        public override bool Equals(object? obj) { throw null; }
        public bool Equals(System.Text.Rune other) { throw null; }
        public override int GetHashCode() { throw null; }
        public static double GetNumericValue(System.Text.Rune value) { throw null; }
        public static System.Text.Rune GetRuneAt(string input, int index) { throw null; }
        public static System.Globalization.UnicodeCategory GetUnicodeCategory(System.Text.Rune value) { throw null; }
        public static bool IsControl(System.Text.Rune value) { throw null; }
        public static bool IsDigit(System.Text.Rune value) { throw null; }
        public static bool IsLetter(System.Text.Rune value) { throw null; }
        public static bool IsLetterOrDigit(System.Text.Rune value) { throw null; }
        public static bool IsLower(System.Text.Rune value) { throw null; }
        public static bool IsNumber(System.Text.Rune value) { throw null; }
        public static bool IsPunctuation(System.Text.Rune value) { throw null; }
        public static bool IsSeparator(System.Text.Rune value) { throw null; }
        public static bool IsSymbol(System.Text.Rune value) { throw null; }
        public static bool IsUpper(System.Text.Rune value) { throw null; }
        public static bool IsValid(int value) { throw null; }
        [System.CLSCompliantAttribute(false)]
        public static bool IsValid(uint value) { throw null; }
        public static bool IsWhiteSpace(System.Text.Rune value) { throw null; }
        public static bool operator ==(System.Text.Rune left, System.Text.Rune right) { throw null; }
        public static explicit operator System.Text.Rune (char ch) { throw null; }
        public static explicit operator System.Text.Rune (int value) { throw null; }
        [System.CLSCompliantAttribute(false)]
        public static explicit operator System.Text.Rune (uint value) { throw null; }
        public static bool operator >(System.Text.Rune left, System.Text.Rune right) { throw null; }
        public static bool operator >=(System.Text.Rune left, System.Text.Rune right) { throw null; }
        public static bool operator !=(System.Text.Rune left, System.Text.Rune right) { throw null; }
        public static bool operator <(System.Text.Rune left, System.Text.Rune right) { throw null; }
        public static bool operator <=(System.Text.Rune left, System.Text.Rune right) { throw null; }
        public static System.Text.Rune ToLower(System.Text.Rune value, System.Globalization.CultureInfo culture) { throw null; }
        public static System.Text.Rune ToLowerInvariant(System.Text.Rune value) { throw null; }
        public override string ToString() { throw null; }
        public static System.Text.Rune ToUpper(System.Text.Rune value, System.Globalization.CultureInfo culture) { throw null; }
        public static System.Text.Rune ToUpperInvariant(System.Text.Rune value) { throw null; }
        public static bool TryCreate(char highSurrogate, char lowSurrogate, out System.Text.Rune result) { throw null; }
        public static bool TryCreate(char ch, out System.Text.Rune result) { throw null; }
        public static bool TryCreate(int value, out System.Text.Rune result) { throw null; }
        [System.CLSCompliantAttribute(false)]
        public static bool TryCreate(uint value, out System.Text.Rune result) { throw null; }
        public bool TryEncodeToUtf16(System.Span<char> destination, out int charsWritten) { throw null; }
        public bool TryEncodeToUtf8(System.Span<byte> destination, out int bytesWritten) { throw null; }
        public static bool TryGetRuneAt(string input, int index, out System.Text.Rune value) { throw null; }
    }

    public ref partial struct SpanRuneEnumerator
    {
        private object _dummy;
        private int _dummyPrimitive;
        public System.Text.Rune Current { get { throw null; } }
        public System.Text.SpanRuneEnumerator GetEnumerator() { throw null; }
        public bool MoveNext() { throw null; }
    }

    public partial struct StringRuneEnumerator : System.Collections.Generic.IEnumerable<System.Text.Rune>, System.Collections.Generic.IEnumerator<System.Text.Rune>, System.Collections.IEnumerable, System.Collections.IEnumerator, System.IDisposable
    {
        private object _dummy;
        private int _dummyPrimitive;
        public System.Text.Rune Current { get { throw null; } }
        object? System.Collections.IEnumerator.Current { get { throw null; } }
        public System.Text.StringRuneEnumerator GetEnumerator() { throw null; }
        public bool MoveNext() { throw null; }
        System.Collections.Generic.IEnumerator<System.Text.Rune> System.Collections.Generic.IEnumerable<System.Text.Rune>.GetEnumerator() { throw null; }
        System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator() { throw null; }
        void System.Collections.IEnumerator.Reset() { }
        void System.IDisposable.Dispose() { }
    }
}

namespace System.Text.Unicode
{
    public static partial class Utf8
    {
        public static System.Buffers.OperationStatus FromUtf16(System.ReadOnlySpan<char> source, System.Span<byte> destination, out int charsRead, out int bytesWritten, bool replaceInvalidSequences = true, bool isFinalBlock = true) { throw null; }
        public static System.Buffers.OperationStatus ToUtf16(System.ReadOnlySpan<byte> source, System.Span<char> destination, out int bytesRead, out int charsWritten, bool replaceInvalidSequences = true, bool isFinalBlock = true) { throw null; }
    }
}

/*
 * These APIs are extension methods to mimic instance APIs added in .NET Core 3.1.
 * The extension methods cannot be type-forwarded, but their 3.1+ implementations can act as
 * thin wrappers around the real framework APIs. Downlevel implementations will have the
 * logic fully contained within these methods.
 */

namespace System
{
    public partial static class MemoryExtensions
    {
        public static System.Text.SpanRuneEnumerator EnumerateRunes(this System.ReadOnlySpan<char> span) { throw null; }
        public static System.Text.SpanRuneEnumerator EnumerateRunes(this System.Span<char> span) { throw null; }
    }

    public partial static class StringExtensions
    {
        public static System.Text.StringRuneEnumerator EnumerateRunes(this string text) { throw null; }
    }
}

The package should target the frameworks net461, netstandard2.0, and netcoreapp3.1. When running on .NET Core 3.1+, the implementations are largely removed and the package type-forwards into the SDK.

Consumers like System.Text.Json and System.Text.Encodings.Web do not need to include references to this package as part of their ref set. However, when compiling for targets before .NET Core 3.1, the implementation assemblies would have a reference to this pacakge.

Q&A

Can we use an existing package instead of creating a new Microsoft.Bcl.Unicode package? Probably, if that's desired. If we wanted to use an existing package, System.Runtime.Extensions and System.Text.Encoding.Extensions would probably be the obvious candidates. But these assemblies have existing sometimes-OOB-sometimes-inbox behavior, which also affects things like versioning. As part of the SDK, there may be implicit references to the APIs contained within these packages, which could cause problems for MemoryExtensions and StringExtensions (see below). In my ideal world, Microsoft.Bcl.Unicode would be a fully-OOB package that would never be included in the inbox SDK, as it's quite literally a polyfill / point-in-time release.

If we did add these APIs to System.Text.Encoding.Extensions or System.Runtime.Extensions, we would need to add a reference from those packages to System.Memory, which contains OperationStatus and [ReadOnly]Span<T>, since these are exposed via the public API surface.

Are MemoryExtensions and StringExtensions a good idea? The proposal above introduces them into the System namespace so that they look like the shipped .NET Core 3.1+ APIs. The type name MemoryExtensions is clearly already taken, but since most callers should invoke the APIs via extension method syntax rather than typical static method invocation syntax, I don't believe this will cause conflicts in practice. The only time it could cause conflicts is if a caller is targeting .NET Core 3.1+ and also has referenced this package explicitly.

If desired, we can also [Obsolete] the APIs in the .NET Core 3.1+ ref, which signals to library authors that they should only be pulling in this package when compiling for downlevel targets. If they're cross-compiling for multiple targets, they should configure their environment not to include this package reference in later APIs. We can even delete the APIs from the ref but leave them in the implementation, which forbids compiling against them but won't block existing compiled code from binding against it during load.

What about new APIs being introduced? For example, there may be desire for downlevel customers to call APIs introduced in #28230, which is slated for introduction in the 6.0 timeframe. We can take these on a case-by-case basis; but generally speaking, there's nothing stopping us from utilizing the same techniques here. Implement the APIs to the best of our ability downlevel - which may involve leaving off some of the API surface - and type-forward on compatible runtimes.

What about the APIs added under System.Globalization? APIs like CompareInfo.IndexOf(ReadOnlySpan<char>, Rune, ...) cannot be implemented out-of-band without taking a significant performance hit. As it is, the OOB implementation (not the inbox implementation) of System.Text.Rune may already allocate for scenarios like Rune.GetUnicodeCategory(new Rune(0x10000)), but we would try to avoid allocations in the common case. For Rune-enlightened APIs on CompareInfo, we would not be able to avoid this allocation under even the most basic of scenarios. I don't think we should attempt to implement such APIs out-of-band.

@GrabYourPitchforks GrabYourPitchforks added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label May 18, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label May 18, 2021
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@jkotas
Copy link
Member

jkotas commented May 19, 2021

create a standalone System.Text.Unicode package

The naming pattern we have established for netstandard2.0 polyfills like this is Microsoft.Bcl.*, e.g. Microsoft.Bcl.AsyncInterfaces or Microsoft.Bcl.HashCode.

@GrabYourPitchforks
Copy link
Member Author

Thanks for the tip! I've updated the proposal to read Microsoft.Bcl.Unicode.

@joperezr
Copy link
Member

@GrabYourPitchforks is your idea to ship this new package in the 6.0 wave? I ask just to set the milestone.

@terrajobst what do you think? This is not really an API Proposal but just a new package proposal so not sure what the right process is for approving something like this.

@joperezr joperezr removed the untriaged New issue has not been triaged by the area owner label May 20, 2021
@GrabYourPitchforks
Copy link
Member Author

@joperezr 6.0 would be ideal, yes, so that our existing packages could remove their copies of this code and take the dependency appropriately.

@ghost ghost added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label May 20, 2021
@joperezr joperezr added this to the 6.0.0 milestone May 20, 2021
@joperezr joperezr removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label May 20, 2021
@krwq
Copy link
Member

krwq commented Jun 17, 2021

@GrabYourPitchforks have made any progress? Is this still planned for 6.0?

@GrabYourPitchforks
Copy link
Member Author

@krwq I made significant progress on this (see here), but since we're so late in the process I don't think we can hit 6.0. Spoke with Jeff about this as well, and will spend a little time looking through other code bases to see if they can benefit from these types or whether we should change which APIs are available in these packages.

@krwq
Copy link
Member

krwq commented Jul 8, 2021

@GrabYourPitchforks I'll move this to 7.0.0, please let us know if there are any updates here.

@danmoseley danmoseley added this to the Future milestone Jul 6, 2022
@jeffhandley
Copy link
Member

There hasn't been much demand for this posted on this issue, but we're not quite ready to close it out. We'd like to invite input and requests on this issue. If you have scenarios where these APIs would be valuable in downlevel scenarios, please share your needs here. If we don't hear of substantial needs over the next couple months, we will close this out.

/cc @geeknoid @KrzysztofCwalina

@KrzysztofCwalina
Copy link
Member

KrzysztofCwalina commented Jun 7, 2023

Azure SDK does not need the Rune APIs, but we do need the UTF8 buffer-based APIs that Levi worked on. (including but not only the Utf8 static class above) It really would be good to have them in the platform.

@eiriktsarpalis
Copy link
Member

I came across this issue while working on #90352 -- briefly contemplated writing yet another Rune polyfill in STJ.

@Neme12
Copy link

Neme12 commented Sep 14, 2023

This would be helpful for me too as I have my own polyfill for Rune as well.

@terrajobst terrajobst added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Dec 12, 2024
@terrajobst terrajobst modified the milestones: Future, 10.0.0 Dec 12, 2024
@terrajobst
Copy link
Member

Let's discuss this for .NET 10. Looks like we have some internal dependencies on this now, which would be a good driver to make progress here.

@bartonjs
Copy link
Member

bartonjs commented Jan 7, 2025

Video

  • All that seems to be needed at this time is System.Text.Unicode.Utf8, so that's all we should add
  • Instead of a new package, it seems like this can be shoehorned into Microsoft.Bcl.Memory
  • The type should forward on net8/net9/net10
namespace System.Text.Unicode
{
    public static partial class Utf8
    {
        public static OperationStatus FromUtf16(ReadOnlySpan<char> source, Span<byte> destination, out int charsRead, out int bytesWritten, bool replaceInvalidSequences = true, bool isFinalBlock = true) { throw null; }
        public static OperationStatus ToUtf16(ReadOnlySpan<byte> source, Span<char> destination, out int bytesRead, out int charsWritten, bool replaceInvalidSequences = true, bool isFinalBlock = true) { throw null; }
        public static bool IsValid(ReadOnlySpan<byte> value);
    }
}

@bartonjs bartonjs added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Jan 7, 2025
@dotnet-policy-service dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-approved API was approved in API review, it can be implemented area-Meta in-pr There is an active PR which will close this issue when it is merged packaging Related to packaging
Projects
No open projects