mssql to parquet as a batch #2274

Someone894 · 2023-11-30T10:05:56Z

Someone894
Nov 30, 2023

Hello,

currently I am trying to read data from a mssql db and store it in an azure_blob_storage as a parquet file.
The azure_blob_storage works so reliable, that in my MWE I use file output instead to keep things simple.

For this test I am trying to load 9 IDs and save them into one singe parquet file.
Sadly I am unable to actuall get this to work and hope that someone here can provide me some help or a hint.
When I save the result in a fixed named file, only the last entry is present.
When I use the timestemp as a file name I see that 9 files each containg one entry are present.

I tried to buffer the data before writing it, but this did not worked.

My final goal is to have a Benthos instance, that I start once a day and it copies all data that accumulated on this day into a yyy_mm_dd.parquet file.

Do you have any ideas what I am doing wrong?
Thank you for your help/time :-)

input:
  label: "database_batched"
  broker:
    inputs:
      - sql_select:
          driver: "mssql"
          dsn: "<connection>"
          table: <table>
          columns: ["*"]
          where: ID > 64 AND ID < 74
    batching:
      count: 0
      byte_size: 0
      period: "5s"
      check: ""

pipeline:
  processors:
    - label: "parquet_encode"
      parquet_encode:
        schema:
          - name: ID
            type: INT64
        default_compression: uncompressed
        default_encoding: PLAIN

output:
  label: "testfile"
  file:
    path: /data/${!timestamp_unix_nano()}.parquet
    #path: /data/data.parquet
    codec: all-bytes

Answered by Someone894

Dec 5, 2023

By now I found my problem period: "5s" is not working but period: 5s is. But @mihaitodor your improved logging is very valuble to me since it feels like Benthos is rather quiet about problems, for example I only found out about the quotes by trying it out at random.

The database I get my data from is updated regularly (a new record every vew secs) but when I set the period to 1h Benthos collects all avilabe data and then just waits for the time to run out without ever checking for new data. Can Benthos be configrued in a way that it collects all data as it occures and starts a new parquet file once a day? Alternatively I just call Benthos once a day and then copy the data from the last da…

View full answer

mihaitodor · 2023-12-05T02:05:01Z

mihaitodor
Dec 5, 2023
Collaborator

Hey @Someone894 👋 I think the parquet_encode processor might be unable to encode your data for some reason. Here's a config to play with:

input:
  generate:
    mapping: |
      root.id = count("foo")
    count: 2
    batch_size: 2

  processors:
    - log:
        message: ${! batch_size() }

    - parquet_encode:
        schema:
          - name: id
            type: INT64
        default_compression: uncompressed
        default_encoding: PLAIN

    - log:
        message: ${! error() }

    - log:
        message: ${! batch_size() }

output:
  file:
    path: ./output.parquet

Note that the logs will print the error as null and batch_size() will evaluate to 2 before the parquet_encode processor and to 1 after it because this processor takes as input a batch of one or more messages and squashes it into a batch of size 1.

In your case, you're setting period: "5s", but I'm guessing that might just produce a single large batch if the input is fast enough to read the entire data in 5 seconds. Try adding some logs like I did above and see if you get any non-nil error. Also check the batch sizes.

It doesn't make sense to use a static name for the output file if you expect to have more than one batch, because then the parquet-encoded data will be appended to it, which I think results in a corrupt parquet file.

0 replies

Someone894 · 2023-12-05T07:34:09Z

Someone894
Dec 5, 2023
Author

By now I found my problem period: "5s" is not working but period: 5s is. But @mihaitodor your improved logging is very valuble to me since it feels like Benthos is rather quiet about problems, for example I only found out about the quotes by trying it out at random.

The database I get my data from is updated regularly (a new record every vew secs) but when I set the period to 1h Benthos collects all avilabe data and then just waits for the time to run out without ever checking for new data. Can Benthos be configrued in a way that it collects all data as it occures and starts a new parquet file once a day? Alternatively I just call Benthos once a day and then copy the data from the last day over.

Of course @mihaitodor you are right about the static file name. Benthos just overwrites the old file, I added it just to keep the MWE short.

Here is the corrected MWE:

input:
  label: "database_batched"
  broker:
    inputs:
      - sql_select:
          driver: "mssql"
          dsn: "<connection>"
          table: <table>
          columns: ["*"]
          where: ID > 64 AND ID < 74
    batching:
      count: 0
      byte_size: 0
      period: 5s
      check: ""

pipeline:
  processors:
    - label: "parquet_encode"
      parquet_encode:
        schema:
          - name: ID
            type: INT64
        default_compression: uncompressed
        default_encoding: PLAIN

output:
  label: "testfile"
  file:
    path: /data/${!timestamp_unix_nano()}.parquet
    #path: /data/data.parquet
    codec: all-bytes

2 replies

mihaitodor Dec 6, 2023
Collaborator

I am fairly certain that period: "5s" is equivalent to period: 5s in yaml. Try playing around with this config:

input:
  generate:
    count: 5
    interval: 1s
    mapping: root = count("test")

output:
  broker:
    outputs:
      - stdout: {}
    batching:
      period: "2s"
      processors:
        - archive:
            format: concatenate

I'm guessing you corrected some other issue in your pipeline at the same time when you made this change.

Someone894 Dec 7, 2023
Author

Today I've tried to rebuild the error I previously had to show you a selfcontained MWE with and without the error, but I wasent able to do so. I think youre right and the quotes should not make a difference, but I rember very clearly that this was the only thing I changed.

On my main "main-quest" I added a switch to process any errors:

output:
  label: "test_out"
  switch:
    retry_until_success: false
    cases:
      - check: errored()
        output:
          label: "bad_files"
          file:
            path: /data/res/bad/${! timestamp_unix_nano() }.json
            codec: all-bytes
      - output:
          label: "ok_files"
          file:
            path: /data/res/ok/zstd.parquet
            codec: all-bytes

which sadly again saves all error cases in indivudual json-files and not in a single one per run. also there is no batchin processor, so i dont know how to batch the data together again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mssql to parquet as a batch #2274

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

mssql to parquet as a batch #2274

Someone894 Nov 30, 2023

Replies: 2 comments · 2 replies

mihaitodor Dec 5, 2023 Collaborator

Someone894 Dec 5, 2023 Author

mihaitodor Dec 6, 2023 Collaborator

Someone894 Dec 7, 2023 Author

Someone894
Nov 30, 2023

Replies: 2 comments 2 replies

mihaitodor
Dec 5, 2023
Collaborator

Someone894
Dec 5, 2023
Author

mihaitodor Dec 6, 2023
Collaborator

Someone894 Dec 7, 2023
Author