Is it possible to emulate `add` with cache S3 ? #2804

nullhack · 2024-08-23T15:04:43Z

nullhack
Aug 23, 2024

Hi, I've been interested in the cache S3 for a specific use case and would like to hear some feedback on my approach.
According to https://docs.redpanda.com/redpanda-connect/components/caches/aws_s3/

It is not possible to atomically upload S3 objects exclusively when the target does not already exist, therefore this cache is not suitable for deduplication.

Trying to emulate a similar outcome using a memory cache in between the input and the S3. I came up with this idea:

set a cache S3
set a cache in memory
set a multilevel cache multilevel_cache=[memory_cache, s3_cache]
Then the yaml to something like:

input:
  (...)
pipeline:
  processors:
    - parallel:
      - branch:
          (...)
          processors:
            - cache:
                resource: multilevel_cache
                operator: get
                key: ${! json("key") }
            - catch:
              - try:
                (generate an UUID...)
                - cache:
                    resource: memory_cache
                    operator: add
                    key: ${! json("key") }
                    value: ${! json("UUID")  }
                - cache:
                    resource: s3_cache
                    operator: set
                    key: ${! json("key") }
                    value: ${! json("UUID")  }
(set the cache ...)

What I was expecting:

The system tries to find the key locally in memory, if not found, try to find in S3 and populate the memory cache
If there's an error, it means that the key is not stored in S3 yet, so we try to add in parallel to the memory cache
- I would expect most of the parallel twin runs would not be able to add to memory and would fail
The only parallel instance that managed to add the key to the memory cache would be setting the file in S3 too

I understand that it would not be completely atomic because if the job fails in the set stage, we would have the key set in memory but not in s3, I expect that this scenario will be extremelly rare and even if happens, it still Ok, as long as no more than one version is written to s3.

Use case:

I'm trying to set UUID in near real time to tokenize (hashed) PII from a stream, but I want to have strong guarantees that once set it will be stored and will not be overwritten. Additionally, I'm trying to avoid having more complexity (e.g. adding Mongodb cluster in the middle or have to deal with AOF files from Redis)

What I am getting:

I still have overwritten versions of the same keys in S3

Question: Is there a point of failure in my reasoning that I haven't thought about? Am I using the processors correctly? Is it possible to achieve 'write once and do not overwrite' writes to S3 using Redpanda connect? Is there a better way to achieve it?

Thank you in advance!

nullhack · 2024-08-24T17:36:36Z

nullhack
Aug 24, 2024
Author

I managed to get a working example:

input:
  generate:
    mapping: |
      root.key = random_int(min:1, max:3)
      root.id = uuid_v4()
      root.ts = now()
    interval: 1ms
    count: 1000
    batch_size: 100
    auto_replay_nacks: false
pipeline:
  processors:
    - group_by_value:
        value: '${! json("key") }'
    - cache:
        resource: multi_cache
        operator: get
        key: 'test/${! json("key") }'
    - catch:
      - catch: []
      - try:
        - cache:
            resource: mem_cache
            operator: add
            key: 'test/${! json("key") }'
            value: '${! json("id") }'
        - cache:
            resource: s3_cache
            operator: set
            key: 'test/${! json("key") }'
            value: '${! json("id") }'

I've tried the following

input:
  generate:
    mapping: |
      root.key = random_int(min:1, max:9)
      root.id = uuid_v4()
      root.ts = now()
    interval: 1ms
    count: 100000
    batch_size: 100
    auto_replay_nacks: false
pipeline:
  processors:
    - group_by_value:
        value: '${! json("key") }'
    - cache:
        resource: multi_cache
        operator: get
        key: 'test/${! json("key") }'
    - catch:
      - catch: []
      - cache:
          resource: multi_cache
          operator: add
          key: 'test/${! json("key") }'
          value: '${! json("id") }'

But if failed my test as well, there was at least one instance of overwrite in the cache.

Is there a better way to achieve this? Maybe something more optimized, I feel weird to use "catch", "catch:[]" and "try" in sequence

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to emulate `add` with cache S3 ? #2804

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Is it possible to emulate add with cache S3 ? #2804

nullhack Aug 23, 2024

Replies: 1 comment

nullhack Aug 24, 2024 Author

Is it possible to emulate `add` with cache S3 ? #2804

nullhack
Aug 23, 2024

nullhack
Aug 24, 2024
Author