Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull in updated Propolis with new VM timeseries, and expunge the old #6169

Merged
merged 3 commits into from
Jul 30, 2024

Conversation

bnaecker
Copy link
Collaborator

@bnaecker
Copy link
Collaborator Author

I tested this by setting up an instance following the steps in docs/how-to-run.adoc. (I also updated those, since the CLI has grown some features since that was written, and we no longer need to use so many raw oxide api calls.) Once the instance was running, I verified the new timeseries look like what I want with omdb oxql:

0x〉get virtual_machine:vcpu_usage | filter instance_id == "5c591b30-ff34-4ce5-8340-9c44a46458ec" && timestamp > @now() - 1m

virtual_machine:vcpu_usage

 instance_id: 5c591b30-ff34-4ce5-8340-9c44a46458ec
 project_id: ad527d61-1496-471c-96ce-eafcd09af0fa
 silo_id: ca938d1d-1503-4c45-a35e-05935f16e60e
 sled_id: 84493faa-6f81-4e9a-b992-25d3ac23f22c
 sled_model: i86pc
 sled_revision: 0
 sled_serial: shale
 state: idle
 vcpu_id: 1
   [2024-07-27 04:19:14.318947266, 2024-07-27 04:24:19.514482578]: [304197630205]
   [2024-07-27 04:24:19.514482578, 2024-07-27 04:24:24.517596926]: [5034136543]
   [2024-07-27 04:24:24.517596926, 2024-07-27 04:24:29.519077580]: [4904946806]
   [2024-07-27 04:24:29.519077580, 2024-07-27 04:24:34.521518373]: [5055034054]
   [2024-07-27 04:24:34.521518373, 2024-07-27 04:24:39.523556778]: [4964988858]
   [2024-07-27 04:24:39.523556778, 2024-07-27 04:24:44.525660266]: [5035193314]
   [2024-07-27 04:24:44.525660266, 2024-07-27 04:24:49.527108833]: [5004918498]
   [2024-07-27 04:24:49.527108833, 2024-07-27 04:24:54.530111888]: [4994701796]
   [2024-07-27 04:24:54.530111888, 2024-07-27 04:24:59.531245904]: [4944856066]
   [2024-07-27 04:24:59.531245904, 2024-07-27 04:25:04.533745160]: [5054700555]
   [2024-07-27 04:25:04.533745160, 2024-07-27 04:25:09.535189573]: [4944809167]
   [2024-07-27 04:25:09.535189573, 2024-07-27 04:25:14.537294821]: [5054567209]
<SNIP>

Note that this commit also contains a ClickHouse "update" -- though that has no changes to the database schema, it does include a step which deletes all the previous virtual machine timeseries that this conflicts with.

- Pulls in oxidecomputer/propolis#724, which
  added sled-identifiers to the virtual machine timeseries. One part of
  #5267.
- Updates requried Crucible dependency.
- Expunge previous schema and data for the virtual machine timeseries,
  as the new schema is incompatible.
@bnaecker
Copy link
Collaborator Author

Yeah, I'm looking at that. Nothing's changed in Omicron in terms of task spawning, so I'm looking at Propolis. Both 4708ab2401aa8bfd66303515ffdf0d9b580fd52e and 0a3a26baf66b97e1666a1f566aa3e93a9a5a0c1d did have something to do with how tasks are spawned, but I'm not yet sure how or whether they're related.

Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with Eliza that we should look into the test failure, but the contents of this PR LGTM otherwise!

@hawkw
Copy link
Member

hawkw commented Jul 29, 2024

Yeah, I'm looking at that. Nothing's changed in Omicron in terms of task spawning, so I'm looking at Propolis. Both 4708ab2401aa8bfd66303515ffdf0d9b580fd52e and 0a3a26baf66b97e1666a1f566aa3e93a9a5a0c1d did have something to do with how tasks are spawned, but I'm not yet sure how or whether they're related.

It looks like the stack trace is from sled-agent, not Propolis — note the source paths. You're right that there have been recent changes to Propolis that involve block_on and friends, but I'm not sure if I get how those changes could make sled-agent panic?

@bnaecker
Copy link
Collaborator Author

I'm with you, and also confused. I don't see how they're really related, especially since the apparent panic site is in a call entirely within the test test_instance_create_timeout_while_creating_zone. That's passing a closure to one of the mock expectations, and that's what's panicking, because the closure calls block_on.

@bnaecker
Copy link
Collaborator Author

Ok, I'm good and confused. I don't see how this ever worked. There are indeed two calls to Runtime::block_on in the call stack. The first is here:

  67:          0x46b68ec - tokio::runtime::runtime::Runtime::block_on::hdd60c178e3d7c8af
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/runtime.rs:347:47

which is the #[tokio::test] wrapper providing us a "current_thread" runtime for the test. The second is here:

  17:          0x48871dd - tokio::runtime::handle::Handle::block_on::h757f00a562cef852
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/handle.rs:309:9
  18:          0x4ba428f - omicron_sled_agent::instance::tests::test_instance_create_timeout_while_creating_zone::{{closure}}::{{closure}}::h4c97373c9d24d8c1
                               at /home/bnaecker/omicron/sled-agent/src/instance.rs:2061:13

Which is from here:

boot_ctx.expect().return_once(move |_| {
rt_handle.block_on(tokio::time::sleep(TIMEOUT_DURATION * 2));
Ok(())
});

That's been there since f8c8bb1, which is from 18 Mar 2024. How has that been succeeding, if there are indeed nested calls to block_on()? This commit also does not modify either Tokio itself or anything related to instance-management, other than the Propolis dependency. As @hawkw pointed out, that doesn't seem to be on the scene at all here, since the test is entirely built on mock objects.

I'll continue to dig.

@bnaecker
Copy link
Collaborator Author

I'm not really sure what I'm looking at yet, but it seems like this might not have been working how we intended it on main either. Running vanilla cargo nextest run test_instance_create_timeout_while_creating_zone shows that the test passes. However, if you run it with backtraces, we see this:

bnaecker@shale : ~/omicron $ RUST_BACKTRACE=full cargo nextest run test_instance_create_timeout_while_creating_zone --nocapture
info: experimental features enabled: setup-scripts
    Finished `test` profile [unoptimized + debuginfo] target(s) in 1.14s
    Starting 1 test across 140 binaries (1373 skipped; run ID: c7210194-698a-4084-939e-1e79be40d977, nextest profile: default)
       START             omicron-sled-agent instance::tests::test_instance_create_timeout_while_creating_zone

running 1 test
log file: /tmp/omicron_sled_agent-116c61e885c32c44-test_instance_create_timeout_while_creating_zone.20671.0.log
note: configured to log to "/tmp/omicron_sled_agent-116c61e885c32c44-test_instance_create_timeout_while_creating_zone.20671.0.log"
Terminating StorageManagerTestHarness
Destroying pool: ZpoolName { id: 1cf28d4b-a935-4ba9-b6f0-b8b63479b044 (zpool), kind: Internal }
Destroying pool: ZpoolName { id: 14078806-65c6-43e1-856c-38f11a7b5223 (zpool), kind: External }
thread 'instance::tests::test_instance_create_timeout_while_creating_zone' panicked at illumos-utils/src/destructor.rs:150:14:
called `Result::unwrap()` on an `Err` value: "channel closed"
stack backtrace:
   0:          0x8ac0a43 - std::backtrace_rs::backtrace::libunwind::trace::h038277e88fb32018
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/../../backtrace/src/backtrace/libunwind.rs:105:5
   1:          0x8ac0a43 - std::backtrace_rs::backtrace::trace_unsynchronized::hee8bce1f0364ec29
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:          0x8ac0a43 - std::sys_common::backtrace::_print_fmt::h862a0ca96496f39c
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/sys_common/backtrace.rs:68:5
   3:          0x8ac0a43 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h51a55916cbfff243
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/sys_common/backtrace.rs:44:22
   4:          0x8b0d4eb - core::fmt::rt::Argument::fmt::heaffd0233edf3600
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/fmt/rt.rs:142:9
   5:          0x8b0d4eb - core::fmt::write::hb97341f5cb67ccb0
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/fmt/mod.rs:1153:17
   6:          0x8ab5b7e - std::io::Write::write_fmt::h8a8490016174d2ba
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/io/mod.rs:1843:15
   7:          0x8ac0847 - std::sys_common::backtrace::_print::h5e789131039ae7e9
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/sys_common/backtrace.rs:47:5
   8:          0x8ac0847 - std::sys_common::backtrace::print::h0bc5d6df46ae408b
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/sys_common/backtrace.rs:34:9
   9:          0x8ac31f9 - std::panicking::default_hook::{{closure}}::h05ad07cb5166223e
  10:          0x8ac2ee6 - std::panicking::default_hook::hfff3b1c5018625b0
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:292:9
  11:          0x8ac3a52 - std::panicking::rust_panic_with_hook::h83507dc0eaa37bc0
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:779:13
  12:          0x8ac3794 - std::panicking::begin_panic_handler::{{closure}}::h3154d00dc41726bd
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:657:13
  13:          0x8ac0f09 - std::sys_common::backtrace::__rust_end_short_backtrace::h4ee5c8bc09608ebb
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/sys_common/backtrace.rs:171:18
  14:          0x8ac34d6 - rust_begin_unwind
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:645:5
  15:          0x8b099e5 - core::panicking::panic_fmt::ha0f28a1a40ca05f3
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/panicking.rs:72:14
  16:          0x8b0a025 - core::result::unwrap_failed::hde36021d185658ce
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/result.rs:1654:5
  17:          0x6a87f98 - unwrap<(), alloc::string::String>
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/result.rs:1077:23
  18:          0x6a87f98 - enqueue_destroy<illumos_utils::link::VnicDestruction>
                               at /home/bnaecker/omicron/illumos-utils/src/destructor.rs:138:9
  19:          0x6a59a06 - <illumos_utils::link::Link as core::ops::drop::Drop>::drop::h601b9a76135538c1
                               at /home/bnaecker/omicron/illumos-utils/src/link.rs:211:13
  20:          0x6a2d111 - core::ptr::drop_in_place<illumos_utils::link::Link>::he8b90960d7939a47
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ptr/mod.rs:515:1
  21:          0x6a2f124 - core::ptr::drop_in_place<illumos_utils::running_zone::InstalledZone>::h53852d464b42b944
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ptr/mod.rs:515:1
  22:          0x4bb6802 - core::ptr::drop_in_place<omicron_sled_agent::instance::InstanceRunner::setup_propolis_inner::{{closure}}>::h5b3d7fca84e115d5
                               at /home/bnaecker/omicron/sled-agent/src/instance.rs:1525:5
  23:          0x4bb043e - core::ptr::drop_in_place<omicron_sled_agent::instance::InstanceRunner::propolis_ensure::{{closure}}>::hf3019c757048f7ee
                               at /home/bnaecker/omicron/sled-agent/src/instance.rs:1274:63
  24:          0x4baa81e - core::ptr::drop_in_place<omicron_sled_agent::instance::InstanceRunner::put_state::{{closure}}>::hf804d070d99de893
                               at /home/bnaecker/omicron/sled-agent/src/instance.rs:1317:44
  25:          0x4c12572 - core::ptr::drop_in_place<omicron_sled_agent::instance::InstanceRunner::run::{{closure}}>::hc61b5c41a5d4c83d
                               at /home/bnaecker/omicron/sled-agent/src/instance.rs:426:60
  26:          0x4c0e5cc - core::ptr::drop_in_place<omicron_sled_agent::instance::Instance::new::{{closure}}>::hacb9fcb48c71738f
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ptr/mod.rs:1093:58
  27:          0x4bb8074 - core::ptr::drop_in_place<alloc::boxed::Box<omicron_sled_agent::instance::Instance::new::{{closure}}>>::he586a36c14e8bd96
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ptr/mod.rs:515:1
  28:          0x4bc4979 - core::ptr::drop_in_place<core::pin::Pin<alloc::boxed::Box<omicron_sled_agent::instance::Instance::new::{{closure}}>>>::h6972bf25892f194b
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ptr/mod.rs:515:1
  29:          0x4bd225d - core::ptr::drop_in_place<tokio::runtime::task::core::Stage<core::pin::Pin<alloc::boxed::Box<omicron_sled_agent::instance::Instance::new::{{closure}}>>>>::h89e83ebdbb3a6dfb
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ptr/mod.rs:515:1
  30:          0x4a460bb - tokio::runtime::task::core::Core<T,S>::set_stage::{{closure}}::h58d014d2d12afd27
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/task/core.rs:382:41
  31:          0x4a41025 - with_mut<tokio::runtime::task::core::Stage<core::pin::Pin<alloc::boxed::Box<omicron_sled_agent::instance::{impl#2}::new::{async_block_env#2}, alloc::alloc::Global>>>, (), tokio::runtime::task::core::{impl#6}::set_stage::{closure_env#0}<core::pin::Pin<alloc::boxed::Box<omicron_sled_agent::instance::{impl#2}::new::{async_block_env#2}, alloc::alloc::Global>>, alloc::sync::Arc<tokio::runtime::scheduler::current_thread::Handle, alloc::alloc::Global>>>
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/loom/std/unsafe_cell.rs:16:9
  32:          0x4a41025 - set_stage<core::pin::Pin<alloc::boxed::Box<omicron_sled_agent::instance::{impl#2}::new::{async_block_env#2}, alloc::alloc::Global>>, alloc::sync::Arc<tokio::runtime::scheduler::current_thread::Handle, alloc::alloc::Global>>
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/task/core.rs:382:9
  33:          0x4a18c39 - tokio::runtime::task::core::Core<T,S>::drop_future_or_output::hdee62338aea4cd55
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/task/core.rs:347:13
  34:          0x4865aac - tokio::runtime::task::harness::cancel_task::{{closure}}::h9a6b9f54888c12e3
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/task/harness.rs:453:9
  35:          0x4b94d65 - core::ops::function::FnOnce::call_once::h10be41695e64ea31
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ops/function.rs:250:5
  36:          0x4705fdd - <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once::h9b204a227b389efd
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/panic/unwind_safe.rs:272:9
  37:          0x463380c - do_call<core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::cancel_task::{closure_env#0}<core::pin::Pin<alloc::boxed::Box<omicron_sled_agent::instance::{impl#2}::new::{async_block_env#2}, alloc::alloc::Global>>, alloc::sync::Arc<tokio::runtime::scheduler::current_thread::Handle, alloc::alloc::Global>>>, ()>
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:552:40
  38:          0x464fe1d - __rust_try
  39:          0x4628cfb - try<(), core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::cancel_task::{closure_env#0}<core::pin::Pin<alloc::boxed::Box<omicron_sled_agent::instance::{impl#2}::new::{async_block_env#2}, alloc::alloc::Global>>, alloc::sync::Arc<tokio::runtime::scheduler::current_thread::Handle, alloc::alloc::Global>>>>
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:516:19
  40:          0x4dfe9c9 - std::panic::catch_unwind::h6ea27da2cd8314e8
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panic.rs:146:14
  41:          0x4863e41 - tokio::runtime::task::harness::cancel_task::hc360bc29380841c8
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/task/harness.rs:452:15
  42:          0x48d68cd - tokio::runtime::task::harness::Harness<T,S>::shutdown::h540e1c1e913555c4
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/task/harness.rs:247:9
  43:          0x4e8da31 - tokio::runtime::task::raw::shutdown::ha3b205ec9e639684
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/task/raw.rs:311:5
  44:          0x8865693 - tokio::runtime::task::raw::RawTask::shutdown::h4a0fbd8a86a5a760
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/task/raw.rs:235:18
  45:          0x886f73a - tokio::runtime::task::Task<S>::shutdown::he4d235e4cae4afd5
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/task/mod.rs:418:9
  46:          0x886ec29 - tokio::runtime::task::list::OwnedTasks<S>::close_and_shutdown_all::h14ade32647a21c8d
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/task/list.rs:156:25
  47:          0x8861668 - tokio::runtime::scheduler::current_thread::shutdown2::h1e0048d5eac63bc7
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/scheduler/current_thread/mod.rs:251:5
  48:          0x88615fe - tokio::runtime::scheduler::current_thread::CurrentThread::shutdown::{{closure}}::h758243cbe5268fbe
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/scheduler/current_thread/mod.rs:231:28
  49:          0x886326b - tokio::runtime::scheduler::current_thread::CoreGuard::enter::{{closure}}::h3a62a469b06f3a93
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/scheduler/current_thread/mod.rs:737:68
  50:          0x8884393 - tokio::runtime::context::scoped::Scoped<T>::set::h0d458a90f21a22fe
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/context/scoped.rs:40:9
  51:          0x87fbfcb - tokio::runtime::context::set_scheduler::{{closure}}::h94a26784164a2609
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/context.rs:180:26
  52:          0x8837477 - try_with<tokio::runtime::context::Context, tokio::runtime::context::set_scheduler::{closure_env#0}<(alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, ()), tokio::runtime::scheduler::current_thread::{impl#8}::enter::{closure_env#0}<tokio::runtime::scheduler::current_thread::{impl#0}::shutdown::{closure_env#1}, ()>>, (alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, ())>
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/thread/local.rs:284:16
  53:          0x8836119 - std::thread::local::LocalKey<T>::with::h7f9bf1e31a9375c6
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/thread/local.rs:260:9
  54:          0x87fbf53 - tokio::runtime::context::set_scheduler::ha78e676b57ccc59e
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/context.rs:180:9
  55:          0x88630bc - tokio::runtime::scheduler::current_thread::CoreGuard::enter::hea97085786c6af59
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/scheduler/current_thread/mod.rs:737:27
  56:          0x88612f0 - shutdown
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/scheduler/current_thread/mod.rs:230:13
  57:          0x885305f - <tokio::runtime::runtime::Runtime as core::ops::drop::Drop>::drop::h7f330532839426ac
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/runtime.rs:474:17
  58:          0x881fc11 - core::ptr::drop_in_place<tokio::runtime::runtime::Runtime>::h8ea59b5b85e49358
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ptr/mod.rs:515:1
  59:          0x4854d8b - test_instance_create_timeout_while_creating_zone
                               at /home/bnaecker/omicron/sled-agent/src/instance.rs:2126:36
  60:          0x4813405 - omicron_sled_agent::instance::tests::test_instance_create_timeout_while_creating_zone::{{closure}}::hb903295bffff8dbb
                               at /home/bnaecker/omicron/sled-agent/src/instance.rs:2046:64
  61:          0x4b98f48 - core::ops::function::FnOnce::call_once::h496c7e13d3b41490
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ops/function.rs:250:5
  62:          0x5093b42 - core::ops::function::FnOnce::call_once::h890207b757df55e1
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ops/function.rs:250:5
  63:          0x5093b42 - test::__rust_begin_short_backtrace::h7d2d4369bdaf4a34
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/test/src/lib.rs:621:18
  64:          0x50931d2 - test::run_test_in_process::{{closure}}::h31000fd7644e6992
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/test/src/lib.rs:644:60
  65:          0x50931d2 - <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once::hb3f7ea42d8a6d9c4
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/panic/unwind_safe.rs:272:9
  66:          0x50931d2 - std::panicking::try::do_call::hf3ecd0821fa0dae8
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:552:40
  67:          0x50931d2 - std::panicking::try::hd6efc5a7a7ffbace
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:516:19
  68:          0x50931d2 - std::panic::catch_unwind::hb9d904bb32ff8f59
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panic.rs:146:14
  69:          0x50931d2 - test::run_test_in_process::hd3f565c458c41dbd
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/test/src/lib.rs:644:27
  70:          0x50931d2 - test::run_test::{{closure}}::hb015f67eee864c44
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/test/src/lib.rs:567:43
  71:          0x505952d - test::run_test::{{closure}}::he615d4cad0e3c9f0
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/test/src/lib.rs:595:41
  72:          0x505952d - std::sys_common::backtrace::__rust_begin_short_backtrace::hfe7d7167648614df
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/sys_common/backtrace.rs:155:18
  73:          0x505e347 - std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}}::he3b5c3f6fe28d7e1
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/thread/mod.rs:528:17
  74:          0x505e347 - <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once::h27dbbbac419554a7
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/panic/unwind_safe.rs:272:9
  75:          0x505e347 - std::panicking::try::do_call::h9d9ab8c864f6fe75
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:552:40
  76:          0x505e347 - std::panicking::try::h6bfedca0d6946fa1
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:516:19
  77:          0x505e347 - std::panic::catch_unwind::h39bfdc24dcdfd6ee
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panic.rs:146:14
  78:          0x505e347 - std::thread::Builder::spawn_unchecked_::{{closure}}::h01f6b1fa8623c4c0
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/thread/mod.rs:527:30
  79:          0x505e347 - core::ops::function::FnOnce::call_once{{vtable.shim}}::h3a34ecbf74f886bc
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ops/function.rs:250:5
  80:          0x8acc0a9 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::hd414fd500219c787
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/alloc/src/boxed.rs:2020:9
  81:          0x8acc0a9 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::h0185933a0a42e8f6
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/alloc/src/boxed.rs:2020:9
  82:          0x8acc0a9 - std::sys::pal::unix::thread::Thread::new::thread_start::h5de3317a74146726
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/sys/pal/unix/thread.rs:108:17
  83: 0xfffffc7fee534287 - _thrp_setup
  84: 0xfffffc7fee5345d0 - <unknown>
test instance::tests::test_instance_create_timeout_while_creating_zone ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 78 filtered out; finished in 1.62s

        PASS [   1.671s] omicron-sled-agent instance::tests::test_instance_create_timeout_while_creating_zone
------------
     Summary [   1.674s] 1 test run: 1 passed, 1373 skipped
bnaecker@shale : ~/omicron $

So the test passed, but a call to cleanup some test resources panicked, I think. That panic is here:

pub fn enqueue_destroy(&self, object: T) {
self.inner
.lock()
.unwrap()
.tx
.send(Message::Data(object))
.map_err(|err| err.to_string())
// Unwrap safety:
// - The DestructorWorker can only be stopped by "try_close"
// - "try_close" can only succeed if it's called on the last strong reference to the
// Destructor.
// - Since "enqueue_destroy" was called on a "Destructor", one more reference must
// exist.
.unwrap();
}

That comment about unwrap safety seems to imply this is never expected to panic. There is a related set of comments here:

// Unwrap safety: the handle must be "Some", so we only get here if the
// DestructorWorker is still running.
//
// This relies on the assumption that "Exit" is the only way to stop the
// worker.

I don't quite see how these are related to the main issue at hand, but it does seem surprising that these have panicked at all. I have a few questions about the unwrap safety and other pieces of that code, though I'd rather not tug on that thread just yet.

@bnaecker
Copy link
Collaborator Author

Ok, well I think I finally figured out why this was not panicking before. Although we set up an expectation here:

boot_ctx.expect().return_once(move |_| {
rt_handle.block_on(tokio::time::sleep(TIMEOUT_DURATION * 2));
Ok(())
});

We never indicate how many times it must be called! The other calls use times(..) to allow any number of calls. (I think that's actually wrong, and we should require at least one call, but that's separate.) If we instead add a .times(1) to that expectation setup, we the test fail on main because the MockZones::boot() call is never issued.

There are at least two things I still don't understand:

  1. Why did my commit introduce any change to this behavior?
  2. Why is there a panic in the test on main (the unwrap() on a closed channel), and yet cargo nextest indicates that the test passes? That seems incorrect.

@bnaecker
Copy link
Collaborator Author

bnaecker commented Jul 29, 2024

Ok, I have a theory for what's going on. I believe it explains both why we never saw the Tokio panic when calling block_on() twice, and why these changes reveal that. The relevant change is the removal of the call to resolve Nexus inside Instance::setup_propolis_inner().

Previously, that attempted to resolve Nexus inside that method; and then provided that address to Propolis as an SMF property. Propolis then passed this on to the oximeter_producer::Server, which then used the address to register as a metric producer.

One of the main changes in this PR is that that's no longer needed. Instead, we let the producer server always use DNS internally. That uses the underlay address we create here and provide to Propolis to create a DNS resolver internally, and always re-resolves Nexus at each call to renew its registration.

So, why does this matter? Well, the whole point of this test is to see what happens when booting the Propolis zone times out. It did previously time out, but not by calling the MockZones::boot() expectation we setup. It timed out in DNS resolution finding Nexus!

That explains why we did not see the panic about block_on(). That method was never reached, as I proved by showing that the test fails differently if we set an expectation that the MockZones::boot() method is called exactly once. It also explains why this test fails: booting now proceeds relatively quickly into the MockZones::boot() method, and boom, we hit the Tokio panic.

Now, how to fix this? I think the best thing would be to fix the test itself. That wasn't exactly doing what it thought it was, but it has a purpose. I am going to try to make the test work more clearly, having the MockZones::boot() method take a while.

@bnaecker
Copy link
Collaborator Author

bnaecker commented Jul 30, 2024

Alright, I think this is fixed in d1ddb22. I removed the offending call to block_on() and replaced it with a small sleep-loop in the test, which blocks instance boot but doesn't panic the runtime.

I still think there's something to investigate with the failures we continue to see, and continue to be swallowed by the test harness, like this:

bnaecker@shale : ~/omicron $ RUST_BACKTRACE=full cargo nextest run test_instance_create_timeout_while_creating_zone --nocapture
info: experimental features enabled: setup-scripts
    Blocking waiting for file lock on build directory
   Compiling omicron-sled-agent v0.1.0 (/home/bnaecker/omicron/sled-agent)
   Compiling nexus-test-utils v0.1.0 (/home/bnaecker/omicron/nexus/test-utils)
   Compiling zone-setup v0.1.0 (/home/bnaecker/omicron/zone-setup)
   Compiling nexus-inventory v0.1.0 (/home/bnaecker/omicron/nexus/inventory)
   Compiling omicron-omdb v0.1.0 (/home/bnaecker/omicron/dev-tools/omdb)
   Compiling omicron-nexus v0.1.0 (/home/bnaecker/omicron/nexus)
   Compiling omicron-dev v0.1.0 (/home/bnaecker/omicron/dev-tools/omicron-dev)
   Compiling reconfigurator-cli v0.1.0 (/home/bnaecker/omicron/dev-tools/reconfigurator-cli)
   Compiling omicron-cockroach-admin v0.1.0 (/home/bnaecker/omicron/cockroach-admin)
   Compiling nexus-reconfigurator-execution v0.1.0 (/home/bnaecker/omicron/nexus/reconfigurator/execution)
   Compiling nexus-db-queries v0.1.0 (/home/bnaecker/omicron/nexus/db-queries)
   Compiling nexus-saga-recovery v0.1.0 (/home/bnaecker/omicron/nexus/saga-recovery)
   Compiling nexus-metrics-producer-gc v0.1.0 (/home/bnaecker/omicron/nexus/metrics-producer-gc)
    Finished `test` profile [unoptimized + debuginfo] target(s) in 1m 08s
    Starting 1 test across 140 binaries (1373 skipped; run ID: 37851df6-ae33-4011-b05e-c98469a397e8, nextest profile: default)
       START             omicron-sled-agent instance::tests::test_instance_create_timeout_while_creating_zone

running 1 test
log file: /tmp/omicron_sled_agent-afd58ef27f7ca5f5-test_instance_create_timeout_while_creating_zone.21783.0.log
note: configured to log to "/tmp/omicron_sled_agent-afd58ef27f7ca5f5-test_instance_create_timeout_while_creating_zone.21783.0.log"
Terminating StorageManagerTestHarness
Destroying pool: ZpoolName { id: 980465a3-bb06-44cb-84df-1b32024e9a0c (zpool), kind: Internal }
Destroying pool: ZpoolName { id: e89bf9a7-353d-4c1c-b026-b9fcf880f5f5 (zpool), kind: External }
thread 'instance::tests::test_instance_create_timeout_while_creating_zone' panicked at illumos-utils/src/destructor.rs:150:14:
called `Result::unwrap()` on an `Err` value: "channel closed"
stack backtrace:
   0:          0x8ac2983 - std::backtrace_rs::backtrace::libunwind::trace::h038277e88fb32018
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/../../backtrace/src/backtrace/libunwind.rs:105:5
   1:          0x8ac2983 - std::backtrace_rs::backtrace::trace_unsynchronized::hee8bce1f0364ec29
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:          0x8ac2983 - std::sys_common::backtrace::_print_fmt::h862a0ca96496f39c
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/sys_common/backtrace.rs:68:5
   3:          0x8ac2983 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h51a55916cbfff243
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/sys_common/backtrace.rs:44:22
   4:          0x8b0f42b - core::fmt::rt::Argument::fmt::heaffd0233edf3600
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/fmt/rt.rs:142:9
   5:          0x8b0f42b - core::fmt::write::hb97341f5cb67ccb0
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/fmt/mod.rs:1153:17
   6:          0x8ab7abe - std::io::Write::write_fmt::h8a8490016174d2ba
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/io/mod.rs:1843:15
   7:          0x8ac2787 - std::sys_common::backtrace::_print::h5e789131039ae7e9
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/sys_common/backtrace.rs:47:5
   8:          0x8ac2787 - std::sys_common::backtrace::print::h0bc5d6df46ae408b
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/sys_common/backtrace.rs:34:9
   9:          0x8ac5139 - std::panicking::default_hook::{{closure}}::h05ad07cb5166223e
  10:          0x8ac4e26 - std::panicking::default_hook::hfff3b1c5018625b0
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:292:9
  11:          0x8ac5992 - std::panicking::rust_panic_with_hook::h83507dc0eaa37bc0
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:779:13
  12:          0x8ac56d4 - std::panicking::begin_panic_handler::{{closure}}::h3154d00dc41726bd
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:657:13
  13:          0x8ac2e49 - std::sys_common::backtrace::__rust_end_short_backtrace::h4ee5c8bc09608ebb
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/sys_common/backtrace.rs:171:18
  14:          0x8ac5416 - rust_begin_unwind
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:645:5
  15:          0x8b0b925 - core::panicking::panic_fmt::ha0f28a1a40ca05f3
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/panicking.rs:72:14
  16:          0x8b0bf65 - core::result::unwrap_failed::hde36021d185658ce
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/result.rs:1654:5
  17:          0x6b88dc8 - unwrap<(), alloc::string::String>
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/result.rs:1077:23
  18:          0x6b88dc8 - enqueue_destroy<illumos_utils::link::VnicDestruction>
                               at /home/bnaecker/omicron/illumos-utils/src/destructor.rs:138:9
  19:          0x6b09ba6 - <illumos_utils::link::Link as core::ops::drop::Drop>::drop::h27d33cc49e163b61
                               at /home/bnaecker/omicron/illumos-utils/src/link.rs:211:13
  20:          0x6add2b1 - core::ptr::drop_in_place<illumos_utils::link::Link>::h2ba3ae93add6aef9
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ptr/mod.rs:515:1
  21:          0x6adf2c4 - core::ptr::drop_in_place<illumos_utils::running_zone::InstalledZone>::h6adb9dc5d1624faf
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ptr/mod.rs:515:1
  22:          0x4dfbcc5 - core::ptr::drop_in_place<illumos_utils::running_zone::RunningZone>::h3a56a8660ce97a8d
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ptr/mod.rs:515:1
  23:          0x4db49ff - core::ptr::drop_in_place<omicron_sled_agent::instance::InstanceRunner::setup_propolis_inner::{{closure}}>::hc78b81dc91df0529
                               at /home/bnaecker/omicron/sled-agent/src/instance.rs:1516:5
  24:          0x4dae6ce - core::ptr::drop_in_place<omicron_sled_agent::instance::InstanceRunner::propolis_ensure::{{closure}}>::h33815d189658c152
                               at /home/bnaecker/omicron/sled-agent/src/instance.rs:1285:63
  25:          0x4da8aae - core::ptr::drop_in_place<omicron_sled_agent::instance::InstanceRunner::put_state::{{closure}}>::he90f45bedcf7b7ae
                               at /home/bnaecker/omicron/sled-agent/src/instance.rs:1328:44
  26:          0x4e10842 - core::ptr::drop_in_place<omicron_sled_agent::instance::InstanceRunner::run::{{closure}}>::ha154260768bf598f
                               at /home/bnaecker/omicron/sled-agent/src/instance.rs:425:60
  27:          0x4e0c94c - core::ptr::drop_in_place<omicron_sled_agent::instance::Instance::new::{{closure}}>::hc918515983c6c8d6
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ptr/mod.rs:1104:58
  28:          0x4db62e4 - core::ptr::drop_in_place<alloc::boxed::Box<omicron_sled_agent::instance::Instance::new::{{closure}}>>::h55d166a5f103ebfe
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ptr/mod.rs:515:1
  29:          0x4dc2cc9 - core::ptr::drop_in_place<core::pin::Pin<alloc::boxed::Box<omicron_sled_agent::instance::Instance::new::{{closure}}>>>::h6ed400e716f5466d
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ptr/mod.rs:515:1
  30:          0x4dd059d - core::ptr::drop_in_place<tokio::runtime::task::core::Stage<core::pin::Pin<alloc::boxed::Box<omicron_sled_agent::instance::Instance::new::{{closure}}>>>>::he1e39ea8825afda0
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ptr/mod.rs:515:1
  31:          0x4c8f79b - tokio::runtime::task::core::Core<T,S>::set_stage::{{closure}}::hec0b266c78c2c938
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/task/core.rs:382:41
  32:          0x4c836d5 - with_mut<tokio::runtime::task::core::Stage<core::pin::Pin<alloc::boxed::Box<omicron_sled_agent::instance::{impl#2}::new::{async_block_env#2}, alloc::alloc::Global>>>, (), tokio::runtime::task::core::{impl#6}::set_stage::{closure_env#0}<core::pin::Pin<alloc::boxed::Box<omicron_sled_agent::instance::{impl#2}::new::{async_block_env#2}, alloc::alloc::Global>>, alloc::sync::Arc<tokio::runtime::scheduler::current_thread::Handle, alloc::alloc::Global>>>
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/loom/std/unsafe_cell.rs:16:9
  33:          0x4c836d5 - set_stage<core::pin::Pin<alloc::boxed::Box<omicron_sled_agent::instance::{impl#2}::new::{async_block_env#2}, alloc::alloc::Global>>, alloc::sync::Arc<tokio::runtime::scheduler::current_thread::Handle, alloc::alloc::Global>>
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/task/core.rs:382:9
  34:          0x4c5b229 - tokio::runtime::task::core::Core<T,S>::drop_future_or_output::h61b19ac0da50e4f6
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/task/core.rs:347:13
  35:          0x4b0affc - tokio::runtime::task::harness::cancel_task::{{closure}}::h634f3459b079ee03
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/task/harness.rs:453:9
  36:          0x4d9e885 - core::ops::function::FnOnce::call_once::ha9d5899d56169020
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ops/function.rs:250:5
  37:          0x4f7272d - <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once::hfc3033f427757c74
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/panic/unwind_safe.rs:272:9
  38:          0x48544bc - do_call<core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::cancel_task::{closure_env#0}<core::pin::Pin<alloc::boxed::Box<omicron_sled_agent::instance::{impl#2}::new::{async_block_env#2}, alloc::alloc::Global>>, alloc::sync::Arc<tokio::runtime::scheduler::current_thread::Handle, alloc::alloc::Global>>>, ()>
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:552:40
  39:          0x486b94d - __rust_try
  40:          0x483aabb - try<(), core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::cancel_task::{closure_env#0}<core::pin::Pin<alloc::boxed::Box<omicron_sled_agent::instance::{impl#2}::new::{async_block_env#2}, alloc::alloc::Global>>, alloc::sync::Arc<tokio::runtime::scheduler::current_thread::Handle, alloc::alloc::Global>>>>
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:516:19
  41:          0x4890b09 - std::panic::catch_unwind::h4117c687f9cef475
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panic.rs:146:14
  42:          0x4b09921 - tokio::runtime::task::harness::cancel_task::hcef6c49fd16fbad7
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/task/harness.rs:452:15
  43:          0x4b7e01d - tokio::runtime::task::harness::Harness<T,S>::shutdown::he99ebb5e9c9d4ef7
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/task/harness.rs:247:9
  44:          0x4e8c661 - tokio::runtime::task::raw::shutdown::h3dba1f91da059567
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/task/raw.rs:311:5
  45:          0x88675d3 - tokio::runtime::task::raw::RawTask::shutdown::h4a0fbd8a86a5a760
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/task/raw.rs:235:18
  46:          0x887167a - tokio::runtime::task::Task<S>::shutdown::he4d235e4cae4afd5
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/task/mod.rs:418:9
  47:          0x8870b69 - tokio::runtime::task::list::OwnedTasks<S>::close_and_shutdown_all::h14ade32647a21c8d
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/task/list.rs:156:25
  48:          0x88635a8 - tokio::runtime::scheduler::current_thread::shutdown2::h1e0048d5eac63bc7
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/scheduler/current_thread/mod.rs:251:5
  49:          0x886353e - tokio::runtime::scheduler::current_thread::CurrentThread::shutdown::{{closure}}::h758243cbe5268fbe
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/scheduler/current_thread/mod.rs:231:28
  50:          0x88651ab - tokio::runtime::scheduler::current_thread::CoreGuard::enter::{{closure}}::h3a62a469b06f3a93
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/scheduler/current_thread/mod.rs:737:68
  51:          0x88862d3 - tokio::runtime::context::scoped::Scoped<T>::set::h0d458a90f21a22fe
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/context/scoped.rs:40:9
  52:          0x87fdf0b - tokio::runtime::context::set_scheduler::{{closure}}::h94a26784164a2609
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/context.rs:180:26
  53:          0x88393b7 - try_with<tokio::runtime::context::Context, tokio::runtime::context::set_scheduler::{closure_env#0}<(alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, ()), tokio::runtime::scheduler::current_thread::{impl#8}::enter::{closure_env#0}<tokio::runtime::scheduler::current_thread::{impl#0}::shutdown::{closure_env#1}, ()>>, (alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, ())>
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/thread/local.rs:284:16
  54:          0x8838059 - std::thread::local::LocalKey<T>::with::h7f9bf1e31a9375c6
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/thread/local.rs:260:9
  55:          0x87fde93 - tokio::runtime::context::set_scheduler::ha78e676b57ccc59e
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/context.rs:180:9
  56:          0x8864ffc - tokio::runtime::scheduler::current_thread::CoreGuard::enter::hea97085786c6af59
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/scheduler/current_thread/mod.rs:737:27
  57:          0x8863230 - shutdown
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/scheduler/current_thread/mod.rs:230:13
  58:          0x8854f9f - <tokio::runtime::runtime::Runtime as core::ops::drop::Drop>::drop::h7f330532839426ac
                               at /home/bnaecker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/runtime.rs:474:17
  59:          0x8821b51 - core::ptr::drop_in_place<tokio::runtime::runtime::Runtime>::h8ea59b5b85e49358
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ptr/mod.rs:515:1
  60:          0x48d6f3b - test_instance_create_timeout_while_creating_zone
                               at /home/bnaecker/omicron/sled-agent/src/instance.rs:2138:36
  61:          0x4b86ba5 - omicron_sled_agent::instance::tests::test_instance_create_timeout_while_creating_zone::{{closure}}::h74160ca66dbabff8
                               at /home/bnaecker/omicron/sled-agent/src/instance.rs:2047:64
  62:          0x4da0c48 - core::ops::function::FnOnce::call_once::hc02ec1ca17d2556c
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ops/function.rs:250:5
  63:          0x5091862 - core::ops::function::FnOnce::call_once::h890207b757df55e1
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ops/function.rs:250:5
  64:          0x5091862 - test::__rust_begin_short_backtrace::h7d2d4369bdaf4a34
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/test/src/lib.rs:621:18
  65:          0x5090ef2 - test::run_test_in_process::{{closure}}::h31000fd7644e6992
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/test/src/lib.rs:644:60
  66:          0x5090ef2 - <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once::hb3f7ea42d8a6d9c4
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/panic/unwind_safe.rs:272:9
  67:          0x5090ef2 - std::panicking::try::do_call::hf3ecd0821fa0dae8
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:552:40
  68:          0x5090ef2 - std::panicking::try::hd6efc5a7a7ffbace
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:516:19
  69:          0x5090ef2 - std::panic::catch_unwind::hb9d904bb32ff8f59
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panic.rs:146:14
  70:          0x5090ef2 - test::run_test_in_process::hd3f565c458c41dbd
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/test/src/lib.rs:644:27
  71:          0x5090ef2 - test::run_test::{{closure}}::hb015f67eee864c44
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/test/src/lib.rs:567:43
  72:          0x505724d - test::run_test::{{closure}}::he615d4cad0e3c9f0
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/test/src/lib.rs:595:41
  73:          0x505724d - std::sys_common::backtrace::__rust_begin_short_backtrace::hfe7d7167648614df
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/sys_common/backtrace.rs:155:18
  74:          0x505c067 - std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}}::he3b5c3f6fe28d7e1
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/thread/mod.rs:528:17
  75:          0x505c067 - <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once::h27dbbbac419554a7
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/panic/unwind_safe.rs:272:9
  76:          0x505c067 - std::panicking::try::do_call::h9d9ab8c864f6fe75
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:552:40
  77:          0x505c067 - std::panicking::try::h6bfedca0d6946fa1
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:516:19
  78:          0x505c067 - std::panic::catch_unwind::h39bfdc24dcdfd6ee
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panic.rs:146:14
  79:          0x505c067 - std::thread::Builder::spawn_unchecked_::{{closure}}::h01f6b1fa8623c4c0
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/thread/mod.rs:527:30
  80:          0x505c067 - core::ops::function::FnOnce::call_once{{vtable.shim}}::h3a34ecbf74f886bc
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ops/function.rs:250:5
  81:          0x8acdfe9 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::hd414fd500219c787
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/alloc/src/boxed.rs:2020:9
  82:          0x8acdfe9 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::h0185933a0a42e8f6
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/alloc/src/boxed.rs:2020:9
  83:          0x8acdfe9 - std::sys::pal::unix::thread::Thread::new::thread_start::h5de3317a74146726
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/sys/pal/unix/thread.rs:108:17
  84: 0xfffffc7fedf64287 - _thrp_setup
  85: 0xfffffc7fedf645d0 - <unknown>
test instance::tests::test_instance_create_timeout_while_creating_zone ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 78 filtered out; finished in 2.47s

        PASS [   2.547s] omicron-sled-agent instance::tests::test_instance_create_timeout_while_creating_zone
------------
     Summary [   2.557s] 1 test run: 1 passed, 1373 skipped

That's a panic on a closed channel during the VnicDestruction drop implementation, but I understand neither why that's happening; whether it should happen; nor why the test harness is not raising that error.

@bnaecker bnaecker requested a review from smklein July 30, 2024 00:24
@smklein smklein requested a review from lifning July 30, 2024 01:18
@smklein
Copy link
Collaborator

smklein commented Jul 30, 2024

Adding @lifning as an FYI, because I think this test was added by #4691

Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look good to me, thanks for all the investigation!

@lifning , based on my reading of the comments, it seems like the test_instance_create_timeout_while_creating_zone test might still be broken, for similar reasons?

@bnaecker bnaecker merged commit b74d691 into main Jul 30, 2024
24 checks passed
@bnaecker bnaecker deleted the update-you-a-propolis branch July 30, 2024 16:37
@lifning
Copy link
Contributor

lifning commented Jul 31, 2024

Started looking into that this morning. Frustratingly enough, i got that "channel closed" traceback to repro on my local illumos machine exactly once -- the first time I ran a full cargo nextest run today after pulling latest main -- and then it never recurred again after a handful of full test suite passes and hundreds of individual runs of test_instance_create_timeout_while_creating_zone in particular (including directly hitting the cargo test-flavored binary outside of the nextest runner, on the off chance that was related at all).

Here's my notes from trying to work out how this could be happening.

The comment in Destructor::enqueue_destroy that says "The DestructorWorker can only be stopped by try_close" is perhaps not 100% true today: There is another place a Message::Exit is instantiated in omicron, in the line msg.unwrap_or(Message::Exit) in the DestructorWorker::run loop itself. In fact, at time of writing this seems to be the only way a Message::Exit is instantiated outside of Destructor's own unit tests, because a grep shows that those are the only invocations of try_close any more.

But then, that means it's not that much of a clue, is it? In fact, it's even more confusing (and the unwrap should have even more safety): If the failure is from the transmit side of the channel complaining that the channel was closed, and the only way for the channel to be closed is for that very channel to have already been closed when DestructorWorker tried to receive from it, meaning there was no longer a transmit-side to be doing the complaining...

In this test, the VnicAllocator is ultimately used to create a control vnic for the fake instance zone:

  • Passed by InstanceRunner::setup_propolis_inner's call to ZoneBuilder::with_underlay_vnic_allocator in sled-agent/src/instance.rs
  • ZoneBuilder::fake_install invokes VnicAllocator::new_control to create a Link (with a reference to the Destructor) in running_zone.rs, passed back to setup_propolis_inner via the control_vnic field of the returned InstalledZone.
  • In the successful case, RunningZone::boot wraps the InstalledZone in a RunningZone struct, which gets further wrapped by a PropolisSetup struct, and the Link would still be extant there.

However, in the timeout case set up by this test, RunningZone::boot calls Zones::boot and gets stuck in our timeout-inducing sleep loop, from which it eventually returns, but by the time it does, our test should be finished.

Is it possible that the DestructorWorker::run task is being canceled after the test body returns? I haven't yet looked into whether tokio::test would do that, but I have started to run out of other plausible explanations.

@lifning
Copy link
Contributor

lifning commented Aug 2, 2024

Something mkeeter pointed out: Crucible has seen a similar issue - https://github.com/oxidecomputer/crucible/blob/main/upstairs/src/client.rs#L2308-L2319 - with the apparent takeaway being that Tokio's behavior of terminating tasks in arbitrary orders can just do that sometimes. (test or not?)

In light of that, it may be reasonable to just not .unwrap() there (or on channel sends/receives in general, I suppose).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants