Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experimental implementation of the On Demand API. #13

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ateska
Copy link
Contributor

@ateska ateska commented Apr 6, 2021

SIMDJSON introduced On Demand API as a default API recently.
This is an experiment that employs this API in the Python/Cython.

Following issues has been identified so far:

Speed is indeed impressive:

----------------------------------------------------------------
# 'perftest/jsonexamples/test.json' 2397 bytes
----------------------------------------------------------------
* cysimdjson (on-demand)   1016803.45 EPS (  1.00)  2437.28 MB/s
* pysimdjson parse          341314.68 EPS (  2.98)   818.13 MB/s
* orjson loads               61601.40 EPS ( 16.51)   147.66 MB/s
* python json loads          40521.16 EPS ( 25.09)    97.13 MB/s
----------------------------------------------------------------

@ateska ateska self-assigned this Apr 6, 2021
@lemire
Copy link
Contributor

lemire commented Apr 6, 2021

cc @jkeiser

@ateska ateska linked an issue Apr 6, 2021 that may be closed by this pull request
@jkeiser
Copy link

jkeiser commented Apr 30, 2021

Holy wow, this is the first time (I think?) I've seen an interpreted language plugin that gets us into multiple GB/s! Nice.

@ateska
Copy link
Contributor Author

ateska commented May 13, 2021

@lemire - I have one question that I cannot crack:

Does On Demand parser assume that a structure of the JSON to be parsed is known in advance?

Thanks.

@ateska ateska added the enhancement New feature or request label May 13, 2021
@lemire
Copy link
Contributor

lemire commented May 13, 2021

Does On Demand parser assume that a structure of the JSON to be parsed is known in advance?

It does not.

However, if you do know the schema, then you can benefit from that knowledge with on demand.

@lemire
Copy link
Contributor

lemire commented Jun 2, 2021

@ateska

Once the value is read, the subsequent (naive) access fails

Let me clarify. The idea is that you (the user) is supposed to take the value and do something with it... Even with the rewind functionality, it would still not be right to rewind whenever you want to access a value.

So if you have [1,2,3] and you parse it... you get 1, then 2... Ok, you need the 1 again? Well, in the spirit of On Demand, the expectation is that you stored the 1 somewhere.

cc @jkeiser

@ateska
Copy link
Contributor Author

ateska commented Jun 2, 2021

@lemire @jkeiser - ok, thanks for clarification. It more or less matches my "mental model".

The problematic bit in Python is that if you want to store the value in the Python-native way, you will have to construct that native type and that's exactly what is slow ... and pysimdjon/cysimdjson takes the advantage of delaying that conversion as much as possible; this is essential for the high performance of the binding.

So my current thinking is that some kind of "intermediate" storage on C++ level will be needed for On-Demand API. And the question is how different is this from the "previous" API.

@lemire
Copy link
Contributor

lemire commented Jun 2, 2021

The problematic bit in Python is that if you want to store the value in the Python-native way, you will have to construct that native type and that's exactly what is slow ... and pysimdjon/cysimdjson takes the advantage of delaying that conversion as much as possible; this is essential for the high performance of the binding. So my current thinking is that some kind of "intermediate" storage on C++ level will be needed for On-Demand API. And the question is how different is this from the "previous" API.

Right. The python-C++ interface is a nasty challenge. I am aware. :-)

@lemire
Copy link
Contributor

lemire commented Apr 10, 2023

Note that the on demand interface has matured considerably since...

@TkTech
Copy link

TkTech commented Apr 10, 2023

Unfortunately the gotcha still exists even with the matured API - it's the same reason pysimdjson has avoided it so far. Given the overwhelming overhead of object construction, the only benefit simdjson wrappers offer to Python over some easier-to-integrate options (like yyjson) is the DOM model for delayed object creation.

@lemire
Copy link
Contributor

lemire commented Apr 10, 2023

@TkTech Granted, but I wanted to stress that many of the earlier comments in this issue are obsolete.

@jkeiser
Copy link

jkeiser commented Apr 11, 2023

@TkTech assuming Python has optimizations for short-lived objects (which I imagine it does), one design I've been thinking about for On Demand python is, to forego the simdjson frontend entirely: make a single call to the tokenizer (stage 1), stash those indices in a Python array, and then do an On Demand frontend in python. That way the opaque C++ boundary doesn't get in the way and Python can do any optimizations it wants (as opposed to when you have to call out to C++ for each value).

@TkTech
Copy link

TkTech commented Apr 11, 2023

@jkeiser that would be an interesting approach and it would be nice functionality to have for other things (cases when the end user knows they will need the entire document at some point) but even the cost of creating that initial array is certainly higher than sparse access through the DOM model. Creating the strings in the array is extremely expensive because of how Python is internally storing the strings, requiring a copy.

@TkTech
Copy link

TkTech commented Apr 11, 2023

Ah, I might have misunderstood. The source of the buffer is coming from python, (which will usually already be in utf-8 internally, so we can get a 0 cost string pointer for C++) and we're only calling simdjson to get the numeric indices for tokens.

@lemire
Copy link
Contributor

lemire commented Apr 11, 2023

I think that @jkeiser's proposal is worth investigating and it is on my todo. Whether it works out is a research question, but it is practical in the sense that it does not require 'years' of difficult implementation. Although, I must say, there are difficulties.

Related discussion: simdjson/simdjson#1912

Note that it applies to JavaScript runtimes as well... oven-sh/bun#2570

@lemire
Copy link
Contributor

lemire commented Apr 11, 2023

Let me quote the results of the recent JavaScript efforts (@Jarred-Sumner)...

Looks like using SIMDJSON is faster when the input is all primitives, but the cost of creating identifiers and converting strings means its slower than using native JSON.parse for objects with keys or strings longer than 1 character.

@TkTech
Copy link

TkTech commented Apr 11, 2023

I'm not sure what @ateska's plans are for this repo (I feel like I'm hijacking his issue :)) but on the pysimdjson side ideally we'd get something like...

from_buffer(const char *buffer, uint64_t size_of_buffer, uint64_t **output, int *bytes_read, int *indices_written, malloc_func_t, realloc_func_t)

... which would be very easy to integrate and allow us to do proper memory tracking and re-use.

@Jarred-Sumner
Copy link

Let me quote the results of the recent JavaScript efforts (@Jarred-Sumner)...

Looks like using SIMDJSON is faster when the input is all primitives, but the cost of creating identifiers and converting strings means its slower than using native JSON.parse for objects with keys or strings longer than 1 character.

In JSC's case, strings must be either latin1 or UTF-16. If the programming language internally supports UTF-8 strings it may be cheaper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Switch to SIMDJSON OnDemand API
5 participants