Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error refreshing domain cache #4933

Closed
ghost opened this issue Aug 10, 2022 · 4 comments
Closed

Error refreshing domain cache #4933

ghost opened this issue Aug 10, 2022 · 4 comments
Labels

Comments

@ghost
Copy link

ghost commented Aug 10, 2022

Version of Cadence server, and client(which language)
This is very important to root cause bugs.

  • Server version: ubercadence/server:0.19.1
  • Client language: Go

Describe the bug
Cadence server is not able to refresh the Domain cache when the Cassandra domain changes

To Reproduce
Is the issue reproducible?

  • Yes

Steps to reproduce the behaviour:

  1. Start the cadence server along with Cassandra DB
  2. Rotate the Cassandra pods [imagine any pod issue]
  3. Now cadence will throw {"level":"error","msg":"Error refreshing domain cache","service":"cadence-frontend","error":"gocql: no hosts available in the pool","logging-call-at":"domainCache.go:401","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/common/cache.(*domainCache).refreshLoop\n\t/cadence/common/cache/domainCache.go:401"}
  4. After this cadence will not be able to connect to Cassandra pods since the domain of the Cassandra pods have changes on the pod rotation

Expected behaviour

  1. When Cassandra pods are rotated the domain cache in cadence should be updated

Screenshots
Logs -
{"level":"error","ts":"2022-08-09T10:06:26.332Z","msg":"Operation failed with internal error.","service":"cadence-frontend","error":"gocql: no hosts available in the pool","metric-scope":42,"logging-call-at":"persistenceMetricClients.go:812","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/common/persistence.(*metadataPersistenceClient).updateErrorMetric\n\t/cadence/common/persistence/persistenceMetricClients.go:812\ngithub.com/uber/cadence/common/persistence.(*metadataPersistenceClient).GetMetadata\n\t/cadence/common/persistence/persistenceMetricClients.go:790\ngithub.com/uber/cadence/common/cache.(*domainCache).refreshDomainsLocked\n\t/cadence/common/cache/domainCache.go:425\ngithub.com/uber/cadence/common/cache.(*domainCache).refreshDomains\n\t/cadence/common/cache/domainCache.go:412\ngithub.com/uber/cadence/common/cache.(*domainCache).refreshLoop\n\t/cadence/common/cache/domainCache.go:396"}

{"level":"error","ts":"2022-08-09T10:06:26.332Z","msg":"Error refreshing domain cache","service":"cadence-frontend","error":"gocql: no hosts available in the pool","logging-call-at":"domainCache.go:401","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/common/cache.(*domainCache).refreshLoop\n\t/cadence/common/cache/domainCache.go:401"}

Additional context
Add any other context about the problem here, E.g. Stackstace, workflow history.

@talha-naeem1
Copy link

I'm facing this issue:

{"level":"error","ts":"2024-04-19T07:05:05.519Z","msg":"Error refreshing domain cache","service":"cadence-matching","error":"ListDomains timed out. Failed to get domain rows. Error: context deadline exceeded","logging-call-at":"domainCache.go:425","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:131\ngithub.com/uber/cadence/common/cache.(*domainCache).refreshLoop\n\t/cadence/common/cache/domainCache.go:425"}

Did someone find anything related to this?

@demirkayaender
Copy link
Contributor

Looks like the query is failing on the storage layer. How is your Cassandra (or the storage you are using) metrics looking? You might need to scale up or out your storage.

Apart from this, just to check if your storage is running at all, are you able run workflows?

@ibarrajo ibarrajo added the bug label Nov 1, 2024
@ibarrajo
Copy link
Contributor

ibarrajo commented Nov 1, 2024

closing since the report is from a deleted user

@ibarrajo ibarrajo closed this as completed Nov 1, 2024
@ibarrajo
Copy link
Contributor

ibarrajo commented Nov 1, 2024

possible duplicate: #4340

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants