feat(c/driver/postgresql): Support for writing DECIMAL types #1288

WillAyd · 2023-11-13T19:57:54Z

No description provided.

WillAyd · 2023-11-13T20:00:15Z

c/driver/postgresql/postgres_copy_reader.h

+    constexpr int kDecDigits = 4;
+
+    // TODO: need some kind of bounds check on this
+    int64_t decimal_int = ArrowDecimalGetIntUnsafe(&decimal);


I don't think this is ultimately correct. I think ideally we would just use the bytes backing the Decimal object, but I haven't yet figured out how that all gets managed when multiple words are required

WillAyd · 2023-11-13T20:01:46Z

Going to require some more time and passes at this, but sharing in case there is any overall thoughts about the architecture.

Once the writer is done, I think it would also be nice to go back and have the COPY reader return decimals directly (right now they get mapped to strings)

lidavidm · 2023-11-13T20:05:37Z

Once the writer is done, I think it would also be nice to go back and have the COPY reader return decimals directly (right now they get mapped to strings)

I think the problem is Postgres decimals are unconstrained in terms of precision/scale and support infinity, NaN, etc. while Arrow decimals are strictly semantics layered on top of 128-bit or 256-bit integers, so the mapping would be a bit wonky

lidavidm · 2023-11-13T20:07:26Z

c/driver/postgresql/postgres_copy_reader.h

+    if (decimal_int < 0) {
+      decimal_int = -decimal_int;
+    }
+    std::vector<int16_t> pg_digits;


probably too small of an optimization to matter, but in principle you should be able to put an upper bound on the number of digits needed to represent an Arrow decimal, and then just stack-allocate?

Yea that's true and actually how postgres does it internally.

https://github.com/postgres/postgres/blob/8680bae8463a0b213893ca6a1c5bb2c2530e823c/src/backend/utils/adt/numeric.c#L8026

If we wanted to stack allocate I guess would just expand that out to whatever is required to store up to 4 decimal words?

Yeah, decimal128 would be 38 digits and decimal256 would be 76

OK cool. Shouldn't be too hard to switch to that - just need to figure out how to handle once I get multi-word decimals supported

WillAyd · 2023-11-13T20:20:12Z

I think the problem is Postgres decimals are unconstrained in terms of precision/scale and support infinity, NaN, etc. while Arrow

Yea that makes sense. So I guess it is standard in the Arrow ecosystem to use a string here?

lidavidm · 2023-11-13T20:41:37Z

I'm not sure about standard, but I don't think we have many choices here :/

WillAyd · 2023-11-13T20:51:01Z

Would it make sense to have a driver option to toggle that? I can see some users not caring much about inf/nan, so pretty unfortunate to not be able to read back in Decimal objects for subsequent computations

lidavidm · 2023-11-13T20:57:36Z

Yeah, I think we're going to end up with some layer of type mapping shenanigans for every driver (SQLite and Snowflake also have to deal with this). We could attempt to convert to an Arrow decimal with given precision/scale based on the column type, and fail (or maybe insert NULL) for invalid values.

lidavidm · 2023-11-13T20:58:07Z

Depending on what people need, we may also want to pursue an extension type (Python decimals behave more like Postgres decimals so there may be some demand, not sure)

jorisvandenbossche · 2023-11-14T07:54:10Z

@WillAyd FYI some previous discussion about support for decimals on the read side: #767

WillAyd · 2023-11-14T12:27:56Z

That is great info thank you @jorisvandenbossche I'll post any more thoughts on the reader there going forward to keep the conversation in on thread. Thanks!

WillAyd · 2023-11-14T21:13:58Z

@lidavidm just want to make sure I have the right conceptual model on how decimal numbers get stored across multiple words. Assuming I had a really long sequence like 12345678901234567890 is this the proper way to construct and populate that decimal?

  const uint64_t large_decimal[2] = {1, 2345678901234567890};
  uint8_t large_decimal_bytes[16];
  std::memcpy(&large_decimal_bytes, large_decimal, sizeof(large_decimal_bytes));
  ArrowDecimalSetBytes(&decimal6, large_decimal_bytes);

Or do I need to worry about the endianness of the platform when defining large_decimal?

lidavidm · 2023-11-15T13:54:45Z

Endianness does matter

/// Exact decimal value represented as an integer value in two's
/// complement. Currently only 128-bit (16-byte) and 256-bit (32-byte) integers
/// are used. The representation uses the endianness indicated
/// in the Schema.

Otherwise I think yes but it might be clearer to just write out the bytes

WillAyd · 2023-11-17T20:03:07Z

When it comes to spanning multiple words I see that the Arrow implementation uses a uint128_t to convert the bytes that span multiple words into a decimal-based value. Though I also see uint128_t comes from boost, which I don't think we want to take on as a dependency.

Do you have any high level guidance on how I should be looking at that conversion from multiple-word bytes into a decimal-based value?

lidavidm · 2023-11-20T14:38:59Z

If we actually need arithmetic on 128 bit integers we could either detect __int128 or vendor an implementation

lidavidm · 2023-12-07T19:19:27Z

For point 1 above I am still trying to figure out how to bolt on a parameterized suite of tests into the existing fixture without making it too complicated, although maybe we have to live with Decimal ingestions tests being different from the rest. Open to ideas

I don't think Googletest is that flexible. Either "parametrize" yourself by looping through a list of cases inside a single actual test case, or create a separate fixture that is parametrized. (Or handwrite a few selected cases.)

lidavidm

LGTM generally, I'll trust you on the algorithm

lidavidm · 2023-12-07T21:11:51Z

c/driver/postgresql/postgres_copy_reader.h

+    bool seen_decimal = scale_ == 0;
+    bool truncating_trailing_zeros = true;
+
+    const std::string decimal_string = DecimalToString<bitwidth_>(&decimal);


another micro-optimization might be to have a stack-allocated char array here and have DecimalToString just fill the char array and return the index of the start; avoids allocating a string in each iteration

(though possibly, it gets all inlined and optimized away anyways)

lidavidm · 2023-12-07T21:13:10Z

c/driver/postgresql/postgres_copy_reader.h

+      const int start_pos = digits_remaining < kDecDigits ?
+        0 : digits_remaining - kDecDigits;
+      const size_t len = digits_remaining < 4 ? digits_remaining : kDecDigits;
+      std::string substr{decimal_string.substr(start_pos, len)};


c++17 would let us use string_view to avoid the extra allocation; we could track indices manually here to avoid it explicitly for now

lidavidm · 2023-12-07T21:13:53Z

c/driver/postgresql/postgres_copy_reader.h

+    NANOARROW_RETURN_NOT_OK(WriteChecked<int16_t>(buffer, dscale, error));
+
+    for (auto pg_digit : pg_digits) {
+      NANOARROW_RETURN_NOT_OK(WriteChecked<int16_t>(buffer, pg_digit, error));


presumably you could check once then memcpy the digits over

(again, possibly the compiler already does this)

WillAyd · 2023-12-18T23:14:19Z

I don't think Googletest is that flexible. Either "parametrize" yourself by looping through a list of cases inside a single actual test case, or create a separate fixture that is parametrized. (Or handwrite a few selected cases.)

Thanks for that guidance. Went with the separate class to parametrize which I think is better for coverage. For now it is self-contained to the postgres tests

WillAyd · 2023-12-22T17:22:22Z

OK I think this is reviewable now. Addressed some of the comments and improved test coverage.

The algorithm to convert from Decimal to string is definitely slow. Here is what the benchmark looks like:

2023-12-22T12:20:26-05:00
Running ./release/driver/postgresql/postgresql-benchmark
Run on (12 X 4700 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 1280 KiB (x6)
  L3 Unified 12288 KiB (x1)
Load Average: 1.34, 0.70, 0.51
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
---------------------------------------------------------------------------------
Benchmark                                       Time             CPU   Iterations
---------------------------------------------------------------------------------
BM_PostgresqlExecute/iterations:1         3757964 ns       277292 ns            1
BM_PostgresqlDecimalWrite/iterations:1   28636835 ns     24109617 ns            1

From some light research I think Arrow uses a "powers-of-ten" approach to determine what the digits should be. Happy to take a look at that in a follow up and see if we can make this faster

WillAyd · 2023-12-24T02:27:33Z

c/driver/postgresql/postgres_copy_reader_test.cc

+  const std::vector<std::optional<ArrowDecimal*>> values = {
+    std::nullopt, &decimal1, &decimal2, &decimal3, &decimal4, &decimal5};
+
+  ArrowSchemaInit(&schema.value);


These 4 lines are essentially what adbc_validation::MakeSchema does. Though that function is templated by type, I wasn't sure if there was a way to make the template type and optionally precision / scale. There certainly could be a more graceful way of handling this in C++ that I am unaware of

I've been meaning to overhaul this for a while now. (Or possibly give in and depend on arrow-cpp...) The current approach really only works for primitive types.

lidavidm · 2024-01-02T20:37:22Z

c/driver/postgresql/postgresql_test.cc

+  // this is a bit of a hack to make std::vector play nicely with
+  // a dynamic number of stack-allocated ArrowDecimal objects
+  constexpr size_t max_decimals = 10;
+  struct ArrowDecimal decimals[max_decimals];
+  if (nrecords > max_decimals) {
+    FAIL() <<
+      " max_decimals exceeded for test case - please change parametrization";
+  }
+
+  std::vector<std::optional<ArrowDecimal*>> values;


What's wrong with vector<optional<ArrowDecimal>>?

Happy to change the MakeBatch implementation to use that type if you prefer; I think just went with the pointer as we use that same pattern with ArrowInterval * in MakeBatchImpl

Ah ok, no problem then.

Again, I'd really like to overhaul these helpers now that we've gotten more usage out of them 😅

Yea agreed these get a bit tough to use. If you have any high level ideas on what you'd like to see I would be happy to try and help as time permits. Think this would be a great exercise to continually improve my C++

Thanks. I need to think through things more but probably we will end up with something that looks more like Arrow C++. And I'd like to take a look at driver implementations and see if we can factor out a nanoarrow++ of sorts.

WillAyd added 4 commits November 7, 2023 16:45

Initial hacks

a24b046

Merge remote-tracking branch 'upstream/main' into copy-decimal

dbddd8b

feat(c/driver/postgresql): Support for writing DECIMAL128

c5dfd05

removed TODO

a091bcd

WillAyd commented Nov 13, 2023

View reviewed changes

lidavidm reviewed Nov 13, 2023

View reviewed changes

trailing decimals

61eb3cc

WillAyd added 4 commits November 25, 2023 20:56

Merge remote-tracking branch 'upstream/main' into copy-decimal

a76ab65

more decimal hacks

078de30

working for positive decimal values

bf8ed7b

Merge branch 'main' into copy-decimal

75cbd58

github-actions bot added this to the ADBC Libraries 0.9.0 milestone Nov 29, 2023

WillAyd added 5 commits November 28, 2023 21:32

negative value support

94bf657

skip other drivers

4b49999

No std::string_view

c046632

cleanups

3957b6d

more generic ToString

c5d19bb

WillAyd changed the title ~~feat(c/driver/postgresql): Support for writing DECIMAL128~~ feat(c/driver/postgresql): Support for writing DECIMAL types Nov 30, 2023

lidavidm reviewed Dec 7, 2023

View reviewed changes

WillAyd added 13 commits December 14, 2023 15:39

Merge remote-tracking branch 'upstream/main' into copy-decimal

10e6e09

less string

e9967a7

Allocate up front

6a0d3c9

compiling with lifecycle issues

df7ba3e

lifecycle workarounds

ac733bf

Try parametrized postgres-test suite

0cf303f

fix test precision / scale arguments

59cdb22

add nullability testing

759b0f1

decimal256 test cases (but failing)

9472ff5

passing DECIMAL256 tests

0eba157

lint

97c2d5c

endian agnosticism

443efed

fixups

5a93f9e

lidavidm removed this from the ADBC Libraries 0.9.0 milestone Dec 19, 2023

msvc compat?

b629aca

github-actions bot added this to the ADBC Libraries 0.9.0 milestone Dec 22, 2023

WillAyd added 3 commits December 22, 2023 09:22

fix COPY test

f5100d0

Simple benchmark

7e7351d

return int instead of void

dc1b735

WillAyd marked this pull request as ready for review December 22, 2023 17:20

WillAyd commented Dec 24, 2023

View reviewed changes

lidavidm approved these changes Jan 2, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/main' into copy-decimal

cc252cd

lidavidm merged commit 2116cff into apache:main Jan 3, 2024
53 of 54 checks passed

WillAyd deleted the copy-decimal branch June 28, 2024 14:56

feat(c/driver/postgresql): Support for writing DECIMAL types #1288

feat(c/driver/postgresql): Support for writing DECIMAL types #1288

Conversation

WillAyd commented Nov 13, 2023

Choose a reason for hiding this comment

WillAyd commented Nov 13, 2023

lidavidm commented Nov 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd commented Nov 13, 2023

lidavidm commented Nov 13, 2023

WillAyd commented Nov 13, 2023

lidavidm commented Nov 13, 2023

lidavidm commented Nov 13, 2023

jorisvandenbossche commented Nov 14, 2023

WillAyd commented Nov 14, 2023

WillAyd commented Nov 14, 2023

lidavidm commented Nov 15, 2023

WillAyd commented Nov 17, 2023

lidavidm commented Nov 20, 2023

lidavidm commented Dec 7, 2023

lidavidm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd commented Dec 18, 2023

WillAyd commented Dec 22, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment