Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Typedbuffer and low level parsers #45

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

npenin
Copy link

@npenin npenin commented May 9, 2021

This PR is dependent on #35 . It basically allows for typed buffer. Along with some textspan (renamed to bufferspan) improvements, it allows to use byte instead of char, opening the possibility for low level parsers like network/serial protocol parsers.

@npenin npenin changed the title Typedbuffer Typedbuffer and low level parsers May 9, 2021
Copy link
Owner

@sebastienros sebastienros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there a way to keep TextSpan : BufferSpan<char> ? And have dedicated methods for string in here, used by the parser that deal with char/string?

@@ -45,7 +45,7 @@ public static bool IsWhiteSpaceOrNewLine(char ch)
public static bool IsNewLine(char ch)
=> (ch == '\n') || (ch == '\r') || (ch == '\v');

public static char ScanHexEscape(string text, int index, out int length)
public static char ScanHexEscape(char[] text, int index, out int length)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not ScanHexEscape(BufferSpan<char> text, int index, out int length) so it doesn't have to allocate a char[]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will have to allocate a char[] as soon as you will have an escape

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but here it has to allocate two of them, the argument, and the returned value.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really: ScanHexEscape just returns a char, so no allocation is done. And from the 2 places it is being called from, that's where the unescaping happens

@@ -68,75 +68,69 @@ public static char ScanHexEscape(string text, int index, out int length)
return (char)code;
}

public static TextSpan DecodeString(string s) => DecodeString(new TextSpan(s));
public static BufferSpan<char> DecodeString(string s) => DecodeString(s.ToCharArray());
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public static BufferSpan<char> DecodeString(string s) => DecodeString(s.ToCharArray());
public static BufferSpan<char> DecodeString(string s) => DecodeString(new BufferSpan<char>(s));

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought so initially, but eventually, you need to build a char[] as you might be removing some characters if there are any escape to happen

Copy link
Author

@npenin npenin May 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may improve it in the case of Span support to avoid this ToCharArray call though

@@ -5,18 +5,20 @@

namespace Parlot.Fluent
{
public sealed class Separated<U, T> : Parser<List<T>>, ICompilable
public sealed class Separated<U, T, TParseContext, TChar> : Parser<List<T>, TParseContext, TChar>, ICompilable<TParseContext, TChar>
where TParseContext : ParseContextWithScanner<Scanner<TChar>, TChar>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ParseContextWithScanner should only be necessary in low level parsers that have to deal with chars (Literals/Terms).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thouht about it too, but it needs to have a scanner to reset position

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and getting rid of the type makes it hard to apply Then/And/Or/... with "really scanner dependent" parsers

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't there be a base abstract type-less Scanner that can reset the position?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was then worried about the dispatch performance (as I faced it for interfaces earlier)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the scanner, that is hardly doable. You may want to do it on ParseContext but it also needs to get the Cursor position.

src/Parlot/Fluent/DecimalLiteral.cs Outdated Show resolved Hide resolved
@npenin
Copy link
Author

npenin commented May 10, 2021

Isn't there a way to keep TextSpan : BufferSpan ? And have dedicated methods for string in here, used by the parser that deal with char/string?

Not if you intend to keep BufferSpan a struct. The way I did it was using some extension methods for BufferSpan

@sebastienros
Copy link
Owner

I think we should also think about how we could pass a stream or PipeReader (https://docs.microsoft.com/en-us/dotnet/api/system.io.pipelines.pipereader?view=dotnet-plat-ext-5.0) as a source.

@npenin
Copy link
Author

npenin commented May 10, 2021

We should first start using SequenceReader (https://docs.microsoft.com/fr-fr/dotnet/api/system.buffers.sequencereader-1?view=net-5.0) instead of the Cursor

@ToCSharp
Copy link

ToCSharp commented Dec 4, 2021

I think we should also think about how we could pass a stream or PipeReader (https://docs.microsoft.com/en-us/dotnet/api/system.io.pipelines.pipereader?view=dotnet-plat-ext-5.0) as a source.

Please, look at https://github.com/ToCSharp/paspan. It is fork of Parlot, based on Spans.
I tryed to speed up huge logs parsing. Using PipeReader is much slower then reading whole file and then parse. Because it is async and it takes time for reader to say writer to read more, may be for socket streams it will be usefull.
Other way of parsing file with FileStream and PaspanMultiSegment(SequenceReader) is in TODO.

@sebastienros
Copy link
Owner

@ToCSharp interesting, some questions:

  • Aren't pipe readers async by nature? That would make everything async in Parlot.
  • How do you handle back-tracking? A PipeReader will help for back pressure, but when it has to revert to a previous point, you need the bugger in memory, so I don't see how to prevent allocating the whole buffer in memory. An option would be to define a custom buffer length (cyclic). and if it's not sufficient throw an exception. Maybe the solution is only in abstracting the Reader, such that any strategy can be used (FileStream, PipeReader, Buffer).
  • I was thinking that maybe Parlot should come in two flavors, bytes or chars, I see you went "bytes only". Would it be a viable alternative? It would require some allocations though for sources that come as string. Or maybe using a reader that converts chars to bytes on the fly. I find it interesting because it allows to only deal with one type (byte) and we don't need to use a TChar like in this PR, or split Parlot in two (bytes vs chars). A SkipWhitespace for instance works can work with bytes. We could still split some Parsers for chars, bytes, or logic flow (separated, zero or many, ...)

I just realized that you took some code from Utf8JsonReader.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants