Experimental implementation of the On Demand API. #13

ateska · 2021-04-06T17:13:17Z

SIMDJSON introduced On Demand API as a default API recently.
This is an experiment that employs this API in the Python/Cython.

Following issues has been identified so far:

Once the value is read, the subsequent (naive) access fails (do we need rewind? We need a "rewind" function in On Demand simdjson/simdjson#1528 )
JSON Pointer is missing: Implement JSON Pointer on top of On Demand simdjson/simdjson#1427

Speed is indeed impressive:

----------------------------------------------------------------
# 'perftest/jsonexamples/test.json' 2397 bytes
----------------------------------------------------------------
* cysimdjson (on-demand)   1016803.45 EPS (  1.00)  2437.28 MB/s
* pysimdjson parse          341314.68 EPS (  2.98)   818.13 MB/s
* orjson loads               61601.40 EPS ( 16.51)   147.66 MB/s
* python json loads          40521.16 EPS ( 25.09)    97.13 MB/s
----------------------------------------------------------------

lemire · 2021-04-06T17:14:08Z

cc @jkeiser

jkeiser · 2021-04-30T16:50:18Z

Holy wow, this is the first time (I think?) I've seen an interpreted language plugin that gets us into multiple GB/s! Nice.

ateska · 2021-05-13T09:59:45Z

@lemire - I have one question that I cannot crack:

Does On Demand parser assume that a structure of the JSON to be parsed is known in advance?

Thanks.

lemire · 2021-05-13T12:19:07Z

Does On Demand parser assume that a structure of the JSON to be parsed is known in advance?

It does not.

However, if you do know the schema, then you can benefit from that knowledge with on demand.

lemire · 2021-06-02T15:04:13Z

@ateska

Once the value is read, the subsequent (naive) access fails

Let me clarify. The idea is that you (the user) is supposed to take the value and do something with it... Even with the rewind functionality, it would still not be right to rewind whenever you want to access a value.

So if you have [1,2,3] and you parse it... you get 1, then 2... Ok, you need the 1 again? Well, in the spirit of On Demand, the expectation is that you stored the 1 somewhere.

cc @jkeiser

ateska · 2021-06-02T15:57:08Z

@lemire @jkeiser - ok, thanks for clarification. It more or less matches my "mental model".

The problematic bit in Python is that if you want to store the value in the Python-native way, you will have to construct that native type and that's exactly what is slow ... and pysimdjon/cysimdjson takes the advantage of delaying that conversion as much as possible; this is essential for the high performance of the binding.

So my current thinking is that some kind of "intermediate" storage on C++ level will be needed for On-Demand API. And the question is how different is this from the "previous" API.

lemire · 2021-06-02T16:15:07Z

The problematic bit in Python is that if you want to store the value in the Python-native way, you will have to construct that native type and that's exactly what is slow ... and pysimdjon/cysimdjson takes the advantage of delaying that conversion as much as possible; this is essential for the high performance of the binding. So my current thinking is that some kind of "intermediate" storage on C++ level will be needed for On-Demand API. And the question is how different is this from the "previous" API.

Right. The python-C++ interface is a nasty challenge. I am aware. :-)

lemire · 2023-04-10T17:17:52Z

Note that the on demand interface has matured considerably since...

TkTech · 2023-04-10T19:13:26Z

Unfortunately the gotcha still exists even with the matured API - it's the same reason pysimdjson has avoided it so far. Given the overwhelming overhead of object construction, the only benefit simdjson wrappers offer to Python over some easier-to-integrate options (like yyjson) is the DOM model for delayed object creation.

lemire · 2023-04-10T19:40:36Z

@TkTech Granted, but I wanted to stress that many of the earlier comments in this issue are obsolete.

jkeiser · 2023-04-11T17:27:15Z

@TkTech assuming Python has optimizations for short-lived objects (which I imagine it does), one design I've been thinking about for On Demand python is, to forego the simdjson frontend entirely: make a single call to the tokenizer (stage 1), stash those indices in a Python array, and then do an On Demand frontend in python. That way the opaque C++ boundary doesn't get in the way and Python can do any optimizations it wants (as opposed to when you have to call out to C++ for each value).

TkTech · 2023-04-11T19:25:14Z

@jkeiser that would be an interesting approach and it would be nice functionality to have for other things (cases when the end user knows they will need the entire document at some point) but even the cost of creating that initial array is certainly higher than sparse access through the DOM model. Creating the strings in the array is extremely expensive because of how Python is internally storing the strings, requiring a copy.

TkTech · 2023-04-11T19:27:58Z

Ah, I might have misunderstood. The source of the buffer is coming from python, (which will usually already be in utf-8 internally, so we can get a 0 cost string pointer for C++) and we're only calling simdjson to get the numeric indices for tokens.

lemire · 2023-04-11T19:58:52Z

I think that @jkeiser's proposal is worth investigating and it is on my todo. Whether it works out is a research question, but it is practical in the sense that it does not require 'years' of difficult implementation. Although, I must say, there are difficulties.

Related discussion: simdjson/simdjson#1912

Note that it applies to JavaScript runtimes as well... oven-sh/bun#2570

lemire · 2023-04-11T20:02:35Z

Let me quote the results of the recent JavaScript efforts (@Jarred-Sumner)...

Looks like using SIMDJSON is faster when the input is all primitives, but the cost of creating identifiers and converting strings means its slower than using native JSON.parse for objects with keys or strings longer than 1 character.

TkTech · 2023-04-11T20:10:06Z

I'm not sure what @ateska's plans are for this repo (I feel like I'm hijacking his issue :)) but on the pysimdjson side ideally we'd get something like...

from_buffer(const char *buffer, uint64_t size_of_buffer, uint64_t **output, int *bytes_read, int *indices_written, malloc_func_t, realloc_func_t)

... which would be very easy to integrate and allow us to do proper memory tracking and re-use.

Jarred-Sumner · 2023-04-11T21:12:30Z

Let me quote the results of the recent JavaScript efforts (@Jarred-Sumner)...

Looks like using SIMDJSON is faster when the input is all primitives, but the cost of creating identifiers and converting strings means its slower than using native JSON.parse for objects with keys or strings longer than 1 character.

In JSC's case, strings must be either latin1 or UTF-16. If the programming language internally supports UTF-8 strings it may be cheaper.

Experimental implementation of the On Demand API.

1810554

ateska self-assigned this Apr 6, 2021

ateska linked an issue Apr 6, 2021 that may be closed by this pull request

Switch to SIMDJSON OnDemand API #7

Open

ateska added the enhancement New feature or request label May 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experimental implementation of the On Demand API. #13

Experimental implementation of the On Demand API. #13

ateska commented Apr 6, 2021 •

edited

Loading

lemire commented Apr 6, 2021

jkeiser commented Apr 30, 2021

ateska commented May 13, 2021

lemire commented May 13, 2021 •

edited

Loading

lemire commented Jun 2, 2021

ateska commented Jun 2, 2021

lemire commented Jun 2, 2021

lemire commented Apr 10, 2023

TkTech commented Apr 10, 2023

lemire commented Apr 10, 2023

jkeiser commented Apr 11, 2023

TkTech commented Apr 11, 2023

TkTech commented Apr 11, 2023

lemire commented Apr 11, 2023

lemire commented Apr 11, 2023

TkTech commented Apr 11, 2023 •

edited

Loading

Jarred-Sumner commented Apr 11, 2023

Experimental implementation of the On Demand API. #13

Are you sure you want to change the base?

Experimental implementation of the On Demand API. #13

Conversation

ateska commented Apr 6, 2021 • edited Loading

lemire commented Apr 6, 2021

jkeiser commented Apr 30, 2021

ateska commented May 13, 2021

lemire commented May 13, 2021 • edited Loading

lemire commented Jun 2, 2021

ateska commented Jun 2, 2021

lemire commented Jun 2, 2021

lemire commented Apr 10, 2023

TkTech commented Apr 10, 2023

lemire commented Apr 10, 2023

jkeiser commented Apr 11, 2023

TkTech commented Apr 11, 2023

TkTech commented Apr 11, 2023

lemire commented Apr 11, 2023

lemire commented Apr 11, 2023

TkTech commented Apr 11, 2023 • edited Loading

Jarred-Sumner commented Apr 11, 2023

ateska commented Apr 6, 2021 •

edited

Loading

lemire commented May 13, 2021 •

edited

Loading

TkTech commented Apr 11, 2023 •

edited

Loading