-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: gossip/chaos/nodes=9 failed #138864
Comments
Puzzling, as always: 2025/01/11 09:00:34 gossip.go:153: sleeping for 1s (1s) The first line roughly covers this code: /pkg/cmd/roachtest/tests/gossip.go#L143-L154 timer := time.AfterFunc(2*time.Second, func() {
// This is an attempt to debug a rare issue in which either the `Printf`
// or the `time.Sleep()` surprisingly take >>20s which causes the test
// to fail.
//
// See https://github.com/cockroachdb/cockroach/issues/130737#issuecomment-2352473436.
_, _ = fmt.Fprintf(os.Stderr, "%s", allstacks.Get())
t.L().Printf("sleep took too long, dumped stacks to Stderr")
})
t.L().Printf("sleeping for %s (%.0fs)\n", sleepDur, timeutil.Since(start).Seconds())
time.Sleep(sleepDur)
timer.Stop() We do not see the "sleep took too long" message, so we stopped the timer within 2s. Then we loop around and hit /pkg/cmd/roachtest/tests/gossip.go#L136-L138 for {
t.L().Printf("checking if gossip excludes dead node %d (%.0fs)\n",
deadNode, timeutil.Since(start).Seconds()) and by the time the There's an execution trace in the artifacts, but it's a tough one to wrangle, since we don't have a goroutine ID to start at. I've tried with side-eye but not much luck yet1. Either way this is azure and this test has been doing things like that for a while, so not release blocking. Footnotes |
Here are all the stacks, taken at the time at which the gossip test detected that something was not right: https://gist.github.com/tbg/93ed1938be95ebfc101ddfab2025b226 |
In particular, cdc tests were running: they just didn't show up in the exec trace since the main goroutine was blocked throughout:
Here's the full list of running tests. grep -B 3 runTest.func2 goros.txt | pbcopy
|
137947: ccl/changeedccl: Add changefeed options into nemesis tests r=wenyihu6 a=aerfrei This work makes sure our nemesis tests for changefeeds randomize over the options we use upon changefeed creation. This randomly adds the key_in_value option (see below) and full_table_name option half of the time and checks that the changefeed messages respect them in the beforeAfter validator. Note the following limitations: the full_table_name option, when on, asserts that the topic in the output will be d.public.{table_name} instead of checking for the actual name of the database/schema. This change also does not add the key_in_value option when for the webhook and cloudstorage sinks. Even before this change, since key_in_value is on by default for those sinks, we remove the key from the value in those testfeed messages for ease of testing. Unfortunately, this makes these cases hard to test, so we leave them out for now. See also: #134119 Epic: [CRDB-42866](https://cockroachlabs.atlassian.net/browse/CRDB-42866) Release note: None 138243: changefeedccl: fix PTS test r=stevendanna a=asg0451 Fix failing TestPTSRecordProtectsTargetsAndSystemTables test Fixes: #135639 Fixes: #138066 Fixes: #137885 Fixes: #137505 Fixes: #136396 Fixes: #135805 Fixes: #135639 Release note: None 138697: crosscluster: add crdb_route parameter for LDR and PCR r=jeffswenson a=jeffswenson The `crdb_route` query parameter determines how the destination cluster's stream processor connects to the source cluster. There are two options for the query parameter: "node" and "gateway". Here is an example of using the route paraemeter to create an external connection that is usable for LDR or PCR. ```SQL -- A connection that routes all replication traffic via the configured -- connection URI. CREATE EXTERNAL CONNECTION 'external://source-db' AS 'postgresql://user:[email protected]:26257/sslmode=verify-full&crdb_route=gateway' -- A connection that enumerates nodes in the source cluster and connects -- directly to nodes. CREATE EXTERNAL CONNECTION 'external://source-db' AS 'postgresql://user:[email protected]:26257/sslmode=verify-full&crdb_route=node' ``` The "node" option is the original and default behavior. The "node" option requires the source and destination clusters to be in the same IP network. The way it works is the connection string supplied to LDR and PCR is used to connect to the source cluster and generate a physical sql plan for the replication. The physical plan includes the `--sql-addvertise-addr` for nodes in the source cluster and processors in the destination cluster connect directly to the nodes. Using the "node" routing is ideal because there are no extra network hops and the source cluster can control how load is distributed across its nodes. The "gateway" option is a new option that is introduced in order to support routing PCR and LDR over a load balancer. When specified, the destination cluster ignores the node addresses returned by the physical plan and instead opens a connection for each processor to the URI supplied by the user. This introduces an extra network hop and does not distribute load as evenly, but it works in deployments where the source cluster is only reachable over a load balancer. Routing over a load balancer only requires changing the destination clusters behavior. Nodes in the source cluster were always implemented to act as a gateway and serve rangefeeds that are backed by data stored on different nodes. This support exists so that the cross cluster replication does not need to re-plan every time a range moves to a different node. Release note (sql change): LDR and PCR may use the `crdb_route=gateway` query option to route the replication streams over a load balancer. Epic: [CRDB-40896](https://cockroachlabs.atlassian.net/browse/CRDB-40896) 138877: opt: reduce allocations when filtering histogram buckets r=mgartner a=mgartner `cat.HistogramBuckets` are now returned and passed by value in `getFilteredBucket` and `(*Histogram).addBucket`, respectively, eliminating some heap allocations. Also, two allocations when building spans from buckets via the `spanBuilder` have been combined into one. The new `(*spanBuilder).init` method simplifies the API by no longer requiring that prefix datums are passed to every invocation of `makeSpanFromBucket`. This also reduces redundant copying of the prefix. Epic: None Release note: None 139029: sql/logictest: disable column family mutations in some cases r=mgartner a=mgartner Random column family mutations are now disabled for `CREATE TABLE` statements with unique, hash-sharded indexes. This prevents the AST from being reserialized with a `UNIQUE` constraint with invalid options, instead of the original `UNIQUE INDEX`. See #65929 and #107398. Epic: None Release note: None 139036: testutils,kvserver: add StartExecTrace and adopt in TestPromoteNonVoterInAddVoter r=tbg a=tbg Every now and then we end up with tests that fail every once in a blue moon, and we can't reproduce at will. #138864 was one of them, and execution traces helped a great deal. This PR introduces a helper for unit tests that execution traces the test and keeps the trace on failure, and adopts it for one of these pesky unit tests. The trace contains the goroutine ID in the filename. Additionally, the test's main goroutine is marked via a trace region. Sample below: <img width="1226" alt="image" src="https://github.com/user-attachments/assets/3f641c28-64f7-4fba-9267-ddd48d8dda03" /> Closes #134383. Epic: None Release note: None Co-authored-by: Aerin Freilich <[email protected]> Co-authored-by: Miles Frankel <[email protected]> Co-authored-by: Jeff Swenson <[email protected]> Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: Tobias Grieger <[email protected]>
roachtest.gossip/chaos/nodes=9 failed with artifacts on master @ 93fb203a469911c4a3ca7fb79f9a94adcb38689d:
Parameters:
Help
See: roachtest README See: How To Investigate (internal) Grafana is not yet available for azure clusters |
I'm going to assign this to test-eng until #138904 is worked out, at which point it's worth taking a fresh look at any more recent failures. |
cc @cockroachdb/test-eng |
Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.
roachtest.gossip/chaos/nodes=9 failed with artifacts on master @ 3cc42e66a71164bd69195ad3c10ab03607a7bc7e:
Parameters:
arch=amd64
cloud=azure
coverageBuild=false
cpu=4
encrypted=false
fs=ext4
localSSD=true
metamorphicLeases=epoch
runtimeAssertionsBuild=true
ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
Grafana is not yet available for azure clusters
This test on roachdash | Improve this report!
Jira issue: CRDB-46385
The text was updated successfully, but these errors were encountered: