Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: gossip/chaos/nodes=9 failed #138864

Open
cockroach-teamcity opened this issue Jan 11, 2025 · 9 comments
Open

roachtest: gossip/chaos/nodes=9 failed #138864

cockroach-teamcity opened this issue Jan 11, 2025 · 9 comments
Labels
A-testing Testing tools and infrastructure B-runtime-assertions-enabled branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-3 Issues/test failures with no fix SLA T-testeng TestEng Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jan 11, 2025

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.gossip/chaos/nodes=9 failed with artifacts on master @ 3cc42e66a71164bd69195ad3c10ab03607a7bc7e:

(gossip.go:93).2: gossip did not stabilize (dead node 4) in 81.4s
test artifacts and logs in: /artifacts/gossip/chaos/nodes=9/run_1

Parameters:

  • arch=amd64
  • cloud=azure
  • coverageBuild=false
  • cpu=4
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • metamorphicLeases=epoch
  • runtimeAssertionsBuild=true
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-46385

@cockroach-teamcity cockroach-teamcity added B-runtime-assertions-enabled branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Jan 11, 2025
@tbg
Copy link
Member

tbg commented Jan 13, 2025

Puzzling, as always:

2025/01/11 09:00:34 gossip.go:153: sleeping for 1s (1s)
2025/01/11 09:01:53 gossip.go:138: checking if gossip excludes dead node 4 (81s)

The first line roughly covers this code:

/pkg/cmd/roachtest/tests/gossip.go#L143-L154

				timer := time.AfterFunc(2*time.Second, func() {
					// This is an attempt to debug a rare issue in which either the `Printf`
					// or the `time.Sleep()` surprisingly take >>20s which causes the test
					// to fail.
					//
					// See https://github.com/cockroachdb/cockroach/issues/130737#issuecomment-2352473436.
					_, _ = fmt.Fprintf(os.Stderr, "%s", allstacks.Get())
					t.L().Printf("sleep took too long, dumped stacks to Stderr")
				})
				t.L().Printf("sleeping for %s (%.0fs)\n", sleepDur, timeutil.Since(start).Seconds())
				time.Sleep(sleepDur)
				timer.Stop()

We do not see the "sleep took too long" message, so we stopped the timer within 2s. Then we loop around and hit

/pkg/cmd/roachtest/tests/gossip.go#L136-L138

			for {
				t.L().Printf("checking if gossip excludes dead node %d (%.0fs)\n",
					deadNode, timeutil.Since(start).Seconds())

and by the time the timeutil.Since(start).Seconds() gets evaluated, 80s have passed! This also rules out the Printf taking that look, since the 81s got formed before that function got invoked. t.() is a simple getter without locks, so it can't have blocked either.

There's an execution trace in the artifacts, but it's a tough one to wrangle, since we don't have a goroutine ID to start at. I've tried with side-eye but not much luck yet1.

Either way this is azure and this test has been doing things like that for a while, so not release blocking.

Footnotes

  1. https://cockroachlabs.slack.com/archives/C06KHP4PU4X/p1736761518749049?thread_ts=1736204265.030069&cid=C06KHP4PU4X

@tbg tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure P-3 Issues/test failures with no fix SLA and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jan 13, 2025
@tbg
Copy link
Member

tbg commented Jan 13, 2025

Aha! I did find something:

test-teardown: 2025/01/11 09:01:54 gossip.go:151: sleep took too long, dumped stacks to Stderr

So this did fire after all. But it fired out of order! We expected to see it between the two lines of log output above. But it fired after.

I found this goroutine in the execution trace (the only match for a search for "gossip"):

image

unfortunately I was unable to make my way back to the roachtest's main goroutine, which is what I really wanted to see. But I found something troubling anyway:

image

A GC cycle that takes 1m23s is crazy. Combined with the heap size and choppy but generally high runnable goroutines I think this Go runtime is just very unhappy.

image image

CPU utilization looks as solidly 100% as I've ever seen:

image image

There just aren't any gaps here.

There's something awry with this roachtest process. I assume we're leaking tons of memory or otherwise build up cruft on these long runs, and anything that gets scheduled towards the end has to make do with a defunct environment.

@tbg
Copy link
Member

tbg commented Jan 13, 2025

I found this test's main goroutine (searching for runTest.func2, which also shows me there are like five other tests running)

image

It shows that the unthinkable is true 🤣 we first sleep for one second (first grey block), then do some stuff, now try to sleep another second, but it ends up being an essentially infinite sleep (second grey span that extends past the right side of the screenshot).

The sleep is terminated only when GC terminates:
image

So I'm guessing we're seeing the fallout of time.After not firing properly while a GC is going on, and since the GC here takes >80s, that's how long we're delayed if unlucky.

I can only assume this kind of environment causes us many spurious failures that we then triage out as test failures, so I'm quite glad I took the extra time to look into this.

@tbg
Copy link
Member

tbg commented Jan 13, 2025

Looking at the processor view, there's a clear pattern:

image

I spot checked some of these 20ms time slices and it lead me to this block of five goroutines (i.e. likely created together)

image

They are all

github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.kafkaManager.startTopicConsumers.func1
        pkg/cmd/roachtest/tests/cdc.go:3371

Interestingly, according to matches on runTest.func2, no cdc test is running: I only see runTLPQuery, kv, gossip/chaos/nodes=9, and some flavor of import/tpcc.

So perhaps the cdc tests don't tear down properly.

@tbg
Copy link
Member

tbg commented Jan 13, 2025

Here are all the stacks, taken at the time at which the gossip test detected that something was not right:

https://gist.github.com/tbg/93ed1938be95ebfc101ddfab2025b226

@tbg
Copy link
Member

tbg commented Jan 13, 2025

In particular, cdc tests were running: they just didn't show up in the exec trace since the main goroutine was blocked throughout:

[09:01:54]W:     [Step 2/2] goroutine 69101 [semacquire, 104 minutes]:
[09:01:54]W:     [Step 2/2] sync.runtime_Semacquire(0x7277b20?)
[09:01:54]W:     [Step 2/2]     GOROOT/src/runtime/sema.go:62 +0x25
[09:01:54]W:     [Step 2/2] sync.(*WaitGroup).Wait(0xa06bad0?)
[09:01:54]W:     [Step 2/2]     GOROOT/src/sync/waitgroup.go:116 +0x48
[09:01:54]W:     [Step 2/2] github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*cdcTester).waitForWorkload(...)
[09:01:54]W:     [Step 2/2]     pkg/cmd/roachtest/tests/cdc.go:707
[09:01:54]W:     [Step 2/2] github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerCDC.func4({0xa06bad0, 0xc0021ad310}, {0xa129e30, 0xc009f1c180}, {0xa17d210, 0xc003b45688})
[09:01:54]W:     [Step 2/2]     pkg/cmd/roachtest/tests/cdc.go:1496 +0x309
[09:01:54]W:     [Step 2/2] main.(*testRunner).runTest.func2()
[09:01:54]W:     [Step 2/2]     pkg/cmd/roachtest/test_runner.go:1315 +0x133
[09:01:54]W:     [Step 2/2] created by main.(*testRunner).runTest in goroutine 121
[09:01:54]W:     [Step 2/2]     pkg/cmd/roachtest/test_runner.go:1296 +0xa45

Here's the full list of running tests.

grep -B 3 runTest.func2 goros.txt | pbcopy

[09:01:54]W:	 [Step 2/2] 	pkg/cmd/roachtest/tests/allocation_bench.go:334 +0x22e
[09:01:54]W:	 [Step 2/2] github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerAllocationBenchSpec.func1({0xa06bad0?, 0xc00225ebe0?}, {0xa129e30?, 0xc001d1d500?}, {0xa17d210?, 0xc003a26788?})
[09:01:54]W:	 [Step 2/2] 	pkg/cmd/roachtest/tests/allocation_bench.go:263 +0x6f
[09:01:54]W:	 [Step 2/2] main.(*testRunner).runTest.func2()
--
[09:01:54]W:	 [Step 2/2] 	pkg/cmd/roachtest/tests/cdc.go:707
[09:01:54]W:	 [Step 2/2] github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerCDC.func4({0xa06bad0, 0xc0021ad310}, {0xa129e30, 0xc009f1c180}, {0xa17d210, 0xc003b45688})
[09:01:54]W:	 [Step 2/2] 	pkg/cmd/roachtest/tests/cdc.go:1496 +0x309
[09:01:54]W:	 [Step 2/2] main.(*testRunner).runTest.func2()
--
[09:01:54]W:	 [Step 2/2] 	pkg/cmd/roachtest/tests/failover.go:876 +0xe1d
[09:01:54]W:	 [Step 2/2] github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerFailover.func2({0xa06bad0?, 0xc5291f7d60?}, {0xa129e30?, 0xc4d81df980?}, {0xa17d210?, 0xc60a5daf08?})
[09:01:54]W:	 [Step 2/2] 	pkg/cmd/roachtest/tests/failover.go:165 +0x39
[09:01:54]W:	 [Step 2/2] main.(*testRunner).runTest.func2()
--
[09:01:54]W:	 [Step 2/2] 	pkg/cmd/roachtest/tests/failover.go:876 +0xe1d
[09:01:54]W:	 [Step 2/2] github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerFailover.func2({0xa06bad0?, 0xc53421fa40?}, {0xa129e30?, 0xc0a3d5a480?}, {0xa17d210?, 0xc0028e4f08?})
[09:01:54]W:	 [Step 2/2] 	pkg/cmd/roachtest/tests/failover.go:165 +0x39
[09:01:54]W:	 [Step 2/2] main.(*testRunner).runTest.func2()
--
[09:01:54]W:	 [Step 2/2] 	pkg/cmd/roachtest/monitor.go:145 +0x39
[09:01:54]W:	 [Step 2/2] github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runFailoverPartialLeaseLiveness({0xa06bad0, 0xc6014bcc30}, {0xa129e30, 0xc2276ccf00}, {0xa17d210, 0xc310bb5908})
[09:01:54]W:	 [Step 2/2] 	pkg/cmd/roachtest/tests/failover.go:772 +0x107d
[09:01:54]W:	 [Step 2/2] main.(*testRunner).runTest.func2()
--
[09:01:54]W:	 [Step 2/2] 	pkg/cmd/roachtest/tests/allocation_bench.go:334 +0x22e
[09:01:54]W:	 [Step 2/2] github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerAllocationBenchSpec.func1({0xa06bad0?, 0xc0019e5220?}, {0xa129e30?, 0xc0051ecf00?}, {0xa17d210?, 0xc0028e4c88?})
[09:01:54]W:	 [Step 2/2] 	pkg/cmd/roachtest/tests/allocation_bench.go:263 +0x6f
[09:01:54]W:	 [Step 2/2] main.(*testRunner).runTest.func2()
--
[09:01:54]W:	 [Step 2/2] 	pkg/cmd/roachtest/tests/allocation_bench.go:334 +0x22e
[09:01:54]W:	 [Step 2/2] github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerAllocationBenchSpec.func1({0xa06bad0?, 0xc0024e5680?}, {0xa129e30?, 0xc0049b6a80?}, {0xa17d210?, 0xc004250788?})
[09:01:54]W:	 [Step 2/2] 	pkg/cmd/roachtest/tests/allocation_bench.go:263 +0x6f
[09:01:54]W:	 [Step 2/2] main.(*testRunner).runTest.func2()
--
[09:01:54]W:	 [Step 2/2] 	pkg/cmd/roachtest/tests/import.go:123 +0x243
[09:01:54]W:	 [Step 2/2] github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerImportTPCC.func2({0xa06bad0?, 0xc8002ff3b0?}, {0xa129e30?, 0xc003553380?}, {0xa17d210?, 0xc77677ef08?})
[09:01:54]W:	 [Step 2/2] 	pkg/cmd/roachtest/tests/import.go:179 +0x50
[09:01:54]W:	 [Step 2/2] main.(*testRunner).runTest.func2()
--
[09:01:54]W:	 [Step 2/2] 	pkg/cmd/roachtest/monitor.go:145 +0x39
[09:01:54]W:	 [Step 2/2] github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runFailoverPartialLeaseLeader({0xa06bad0, 0xc6eea282d0}, {0xa129e30, 0xc62804b080}, {0xa17d210, 0xc0892e5b88})
[09:01:54]W:	 [Step 2/2] 	pkg/cmd/roachtest/tests/failover.go:658 +0x115e
[09:01:54]W:	 [Step 2/2] main.(*testRunner).runTest.func2()
--
[09:01:54]W:	 [Step 2/2] 	pkg/cmd/roachtest/tests/follower_reads.go:1063 +0x632
[09:01:54]W:	 [Step 2/2] github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runFollowerReadsMixedVersionSingleRegionTest({0xa06bad0, 0xc0147ff860}, {0xa129e30, 0xc39fe8f680}, {0xa17d210, 0xc690cc8c88})
[09:01:54]W:	 [Step 2/2] 	pkg/cmd/roachtest/tests/follower_reads.go:979 +0x1ca
[09:01:54]W:	 [Step 2/2] main.(*testRunner).runTest.func2()

craig bot pushed a commit that referenced this issue Jan 14, 2025
137947: ccl/changeedccl: Add changefeed options into nemesis tests r=wenyihu6 a=aerfrei

This work makes sure our nemesis tests for changefeeds randomize
over the options we use upon changefeed creation. This randomly adds
the key_in_value option (see below) and full_table_name option half
of the time and checks that the changefeed messages respect them in
the beforeAfter validator.

Note the following limitations: the full_table_name option, when on,
asserts that the topic in the output will be d.public.{table_name}
instead of checking for the actual name of the database/schema.

This change also does not add the key_in_value option when for the
webhook and cloudstorage sinks. Even before this change, since
key_in_value is on by default for those sinks, we remove the key
from the value in those testfeed messages for ease of testing.
Unfortunately, this makes these cases hard to test, so we leave them
out for now.

See also: #134119

Epic: [CRDB-42866](https://cockroachlabs.atlassian.net/browse/CRDB-42866)

Release note: None

138243: changefeedccl: fix PTS test r=stevendanna a=asg0451

Fix failing TestPTSRecordProtectsTargetsAndSystemTables test

Fixes: #135639
Fixes: #138066
Fixes: #137885
Fixes: #137505
Fixes: #136396
Fixes: #135805
Fixes: #135639

Release note: None

138697: crosscluster: add crdb_route parameter for LDR and PCR r=jeffswenson a=jeffswenson

The `crdb_route` query parameter determines how the destination
cluster's stream processor connects to the source cluster. There are two
options for the query parameter: "node" and "gateway". Here is an
example of using the route paraemeter to create an external connection
that is usable for LDR or PCR.

```SQL
-- A connection that routes all replication traffic via the configured
-- connection URI.
CREATE EXTERNAL CONNECTION 'external://source-db' AS
'postgresql://user:[email protected]:26257/sslmode=verify-full&crdb_route=gateway'

-- A connection that enumerates nodes in the source cluster and connects
-- directly to nodes.
CREATE EXTERNAL CONNECTION 'external://source-db' AS
'postgresql://user:[email protected]:26257/sslmode=verify-full&crdb_route=node'
```

The "node" option is the original and default behavior. The "node"
option requires the source and destination clusters to be in the same IP
network. The way it works is the connection string supplied to LDR and
PCR is used to connect to the source cluster and generate a physical sql
plan for the replication. The physical plan includes the
`--sql-addvertise-addr` for nodes in the source cluster and processors
in the destination cluster connect directly to the nodes. Using the
"node" routing is ideal because there are no extra network hops and the
source cluster can control how load is distributed across its nodes.

The "gateway" option is a new option that is introduced in order to
support routing PCR and LDR over a load balancer. When specified, the
destination cluster ignores the node addresses returned by the physical
plan and instead opens a connection for each processor to the URI
supplied by the user. This introduces an extra network hop and does not
distribute load as evenly, but it works in deployments where the source
cluster is only reachable over a load balancer.

Routing over a load balancer only requires changing the destination
clusters behavior. Nodes in the source cluster were always implemented to
act as a gateway and serve rangefeeds that are backed by data stored on
different nodes. This support exists so that the cross cluster
replication does not need to re-plan every time a range moves to a
different node.

Release note (sql change): LDR and PCR may use the `crdb_route=gateway`
query option to route the replication streams over a load balancer.

Epic: [CRDB-40896](https://cockroachlabs.atlassian.net/browse/CRDB-40896)

138877: opt: reduce allocations when filtering histogram buckets r=mgartner a=mgartner

`cat.HistogramBuckets` are now returned and passed by value in
`getFilteredBucket` and `(*Histogram).addBucket`, respectively,
eliminating some heap allocations.

Also, two allocations when building spans from buckets via the
`spanBuilder` have been combined into one. The new `(*spanBuilder).init`
method simplifies the API by no longer requiring that prefix datums are
passed to every invocation of `makeSpanFromBucket`. This also reduces
redundant copying of the prefix.

Epic: None

Release note: None


139029: sql/logictest: disable column family mutations in some cases r=mgartner a=mgartner

Random column family mutations are now disabled for `CREATE TABLE`
statements with unique, hash-sharded indexes. This prevents the AST
from being reserialized with a `UNIQUE` constraint with invalid options,
instead of the original `UNIQUE INDEX`. See #65929 and #107398.

Epic: None

Release note: None


139036: testutils,kvserver: add StartExecTrace and adopt in TestPromoteNonVoterInAddVoter r=tbg a=tbg

Every now and then we end up with tests that fail every once in a blue moon, and we can't reproduce at will.
#138864 was one of them, and execution traces helped a great deal.

This PR introduces a helper for unit tests that execution traces the test and keeps the trace on failure, and adopts it for one of these pesky unit tests.

The trace contains the goroutine ID in the filename. Additionally, the test's main goroutine is marked via a trace region. Sample below:

<img width="1226" alt="image" src="https://github.com/user-attachments/assets/3f641c28-64f7-4fba-9267-ddd48d8dda03" />

Closes #134383.

Epic: None
Release note: None


Co-authored-by: Aerin Freilich <[email protected]>
Co-authored-by: Miles Frankel <[email protected]>
Co-authored-by: Jeff Swenson <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
@cockroach-teamcity
Copy link
Member Author

roachtest.gossip/chaos/nodes=9 failed with artifacts on master @ 93fb203a469911c4a3ca7fb79f9a94adcb38689d:

(gossip.go:93).2: gossip did not stabilize (dead node 4) in 112.2s
test artifacts and logs in: /artifacts/gossip/chaos/nodes=9/run_1

Parameters:

  • arch=amd64
  • cloud=azure
  • coverageBuild=false
  • cpu=4
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • metamorphicLeases=epoch
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

This test on roachdash | Improve this report!

@tbg
Copy link
Member

tbg commented Jan 22, 2025

I'm going to assign this to test-eng until #138904 is worked out, at which point it's worth taking a fresh look at any more recent failures.

@tbg tbg added T-testeng TestEng Team and removed T-kv KV Team labels Jan 22, 2025
Copy link

blathers-crl bot commented Jan 22, 2025

cc @cockroachdb/test-eng

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure B-runtime-assertions-enabled branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-3 Issues/test failures with no fix SLA T-testeng TestEng Team
Projects
None yet
Development

No branches or pull requests

2 participants