Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source Connector Heartbeat #1259

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mjholder
Copy link
Contributor

@mjholder mjholder commented Oct 25, 2024

Link(s) to Jira

Description of Intent of Change(s)

The what, why and how.

Right now our stage DB is not supporting replication. That means our outbox is idle.
This is causing our WAL to fill up and have no opportunity to progress. More info: https://www.morling.dev/blog/insatiable-postgres-replication-slot/

This change will add a heartbeat to the outbox connector and force progress at least once every 5 minutes (300,000 ms).

Before this fix we will see the source connector in stage with the status NotReady.
If we look at the status in the connector's yaml, under connectorStatus -> tasks -> trace, you'll see:
Unable to obtain valid replication slot. Make sure there are no long-running transactions ...

To fix this I use GABI:

  1. Login to stage and copy your token from the login command, you will need this to run queries on the DB
  2. Check the replication slot stats. It should be "lost". Also our replication slot is called debezium.
    curl -H "Authorization: Bearer <TOKEN>" https://gabi-rbac-stage.apps.crcs02ue1.urby.p1.openshiftapps.com/query -d "{\"query\": \"Select * from pg_replication_slots;\"}" | jq
  3. Delete the lost slot
    curl -H "Authorization: Bearer <TOKEN>" https://gabi-rbac-stage.apps.crcs02ue1.urby.p1.openshiftapps.com/query -d "{\"query\": \"select pg_drop_replication_slot('debezium');\"}" | jq
  4. Recreate the replication slot
    curl -H "Authorization: Bearer <TOKEN>" https://gabi-rbac-stage.apps.crcs02ue1.urby.p1.openshiftapps.com/query -d "{\"query\": \"select pg_create_logical_replication_slot('debezium', 'pgoutput');\"}" | jq
  5. To get the connector working again you need to run curl -X POST localhost:8083/connectors/rbac-debezium/tasks/0/restart in the connect pod in stage. The pod being platform-kafka-connect-connect in the platfrom-mq-stage namespace. Note: this may take a couple minutes to show the connector as ready.

All that being said, this is an edge case that occurs only on idle tables. Once we have replication online, this won't be a problem anymore.

Local Testing

How can the feature be exercised?
How can the bug be exploited and fix confirmed?
Is any special local setup required?

Checklist

  • if API spec changes are required, is the spec updated?
  • are there any pre/post merge actions required? if so, document here.
  • are theses changes covered by unit tests?
  • if warranted, are documentation changes accounted for?
  • does this require migration changes?
    • if yes, are they backwards compatible?
  • is there known, direct impact to dependent teams/components?
    • if yes, how will this be handled?

Secure Coding Practices Checklist Link

Secure Coding Practices Checklist

  • Input Validation
  • Output Encoding
  • Authentication and Password Management
  • Session Management
  • Access Control
  • Cryptographic Practices
  • Error Handling and Logging
  • Data Protection
  • Communication Security
  • System Configuration
  • Database Security
  • File Management
  • Memory Management
  • General Coding Practices

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant